Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval #513

rvagg · 2024-01-04T00:14:35Z

This started as a Notion document but it's kind of hidden there, although I've now published it for public viewing and commenting. I don't want it to get lost as I start to focus on other things. I did intend to code up a version of this in go-car and work up a spec PR but I'm worried that it'll get lost in the wash so I'm recording this here for permanency and hopefully a bit more engagement with other folks.

Why Not Both? Packing Content for IPLD vs Piece (IntactPack)

TL;DR

Instead of packing and storing file / blob data like this:

Pack and store it like this:

Background

There currently exists two views, and two separate uses of Filecoin storage of content: IPLD block focused and Filecoin as an opaque block device. These exist as lenses that are used to view the nature of data stored on Filecoin but also impact the technology choices as we move up the stack.

Vision 1: IPLD block focused

Data preparation: packed into CARs, blocks are the typical ~1MB maximum for compatibility and incremental verifiability, file data is encoded as UnixFS (or similar) @
Retrievals: Bitswap (per-block) or HTTP Trustless (per-block or a broad selector description for an IPLD DAG)

Vision 2: Filecoin is an opaque block-device

Data preparation: however you like, but currently typical to still pack into standard CARs, but it is very important for clients to store metadata about the layouts of their pieces so they can fetch blocks or ranges that they care about straight out of pieces.
Retrievals: HTTP Piece retrieval with range requests to slice and dice

Competing visions of Filecoin as an IPLD-block storage layer vs a opaque byte storage is older than the network. Opaque bytes make everything on the Storage Provider side significantly simpler and cheaper. Yet deep IPLD integration provides significant value to the network (making this case is beyond the scope of this document, but this is a strongly held conviction by some).

Planning for rearchitecting miner and markets software takes the cost and complexity view that the IPLD layer is a value-add and should be considered an optional extra, to be added if/when it’s needed/demanded. Storing and proving pieces is the critical activity, retrieving pieces is significantly simpler and more efficient (in many definitions of “efficient”) and indexing at the IPLD-block level is complexity that been persistently difficult to solve and is one of the largest costs and risks for scalability.

Finding “Why Not Both?” Approaches

The ideal for storage providers is a choose-your-complexity model, where the simplest form of the software stack has the lowest complexity and cost, but complexity can be added to increase the value-add to a storage provider’s customers where it is needed.

CommPact is one attempt at describing a world where piece data can also be viewed through an IPLD lens, by re-purposing the piece commitment proving tree over the raw piece data. This path is one form of in-between (”why not both?”) approach. In this form, the data is piece-first, in that it is prepared with the perspective of it being stored in a piece, and the IPLD lens can be applied after-the-fact as required (i.e. the IPLD lens is optional).

Why Not Both: IPLD-first

To serve both worlds, we could take an approach where data preparation still involves packing data in a CAR, including (some) UnixFS encoding, but lays out that data in such a way that it can also be retrieved in whole (or large) portions via piece retrieval. Markets software that is aware of this format can serve content both as standard, small, IPLD blocks, and/or as raw pieces. A storage provider who does not want or need to index IPLD blocks can still serve the content to clients that have stored metadata about content location, while a storage provider that wants to value-add with full Bitswap and HTTP retrieval can extract and serve blocks using the minimal metadata included in the packed form. Most importantly, this proposal does not break existing specifications for UnixFS or the CAR format, but it does add some challenges for existing use-cases making assumptions about what they contain.

IntactPack

Input data can be opaque bytes, or it can be file data. In both cases, it is UnixFS encoded to the reference layer. i.e. this “reference layer” is the UnixFS DAG is prepared above the raw byte leaves, as if those leaves are also going to be stored in the CAR. This reference layer is then packed in a CAR, and then the whole byte blob is then added to the CAR as a single (potentially very large) IPLD block, complete with its own CID prefix.

This leaves us with two separate parts to a CAR: the reference layer which has links to blocks that don’t technically exist as sections in the CAR, and the large byte block, which has a CID that is not referenced by anything.

The process can be repeated for any number of blobs in order to fill up a CAR to the required size. Likewise, blobs that are larger than the required size can be split up in to smaller chunks and treated as separate blobs (note there are some complexities in this case which require us to apply some special handling, this is described below).

Standard UnixFS encoding and packing files / blobs into CAR format introduces challenges for Piece retrievals:

File is chunked into small pieces, each piece is length and CID prefixed in the CAR
Raw leaves are packed interleaved with reference layer blocks that hold the graph together

Standard UnixFS encoding of a file / blob: Raw byte blocks chunked at bottom layer, reference layer above to contain the leaves in a graph.

Standard UnixFS file / blob packed in order in CAR format.

A full CAR with multiple files / blobs

Retrieving a file / blob using HTTP piece protocol involves:

Keeping reference to the location of the start of the file in the piece
Retrieving and decoding as UnixFS
Partial retrievals require either indexing individual block locations and their file offsets, difficulty increases with the size of the range required

UnixFS encoding of a file / blob with the reference layer only, leaves are virtual and the UnixFS blocks contain the details of where they exist in the original stream of bytes

UnixFS reference layer blocks packed together in a CAR followed by single raw block containing original (non-chunked) bytes

A full CAR with multiple files / blobs with non-standard format, original files are fully contiguous

Retrieving a file / blob using this modified UnixFS packing requires less accounting (in minimal form) and a simpler process:

In a minimal form, only the start of the file, or CAR section, is required to perform full-file fetches
Decoding using UnixFS is optional: incremental verification is still possible if the reference layer is also fetched or is stored locally; likewise the full file is still content-addressed with its own CID for the full series of bytes and can be post-verified where incremental verification isn’t as critical
Partial retrievals are simpler, offsets are easily calculated, incremental verification is still possible if the reference layer is fetched or stored
The reference layer only can be easily fetched separately as a “proving tree”, which can be useful for many purposes

Handling Very-large Blobs

Where packing within Filecoin sector constraints, CARs must be under the ~32G (or ~64G) threshold to make a storable piece. Data that is larger than this threshold can be handled by splitting the bytes into roughly sector sized chunks and storing those chunks prefixed by a full, valid reference layer (only) for that chunk.

The first chunk CAR is straightforward to read and understand. Understanding subsequent chunk CARs requires inspecting the initial portion of the reference layer to find the first leaf block’s file offset. The length of the chunk can be either inferred from the raw block’s length or from the reference layer by also inspecting the last block’s offset and length.

Tooling and Protocol Integration

This packing format does not strictly break any existing specifications:

There currently exists within the IPFS and Filecoin stack no strict requirement for block ordering in CARs, except on the HTTP Trustless transport protocol (see below)
1. Likewise: there currently exists no strict requirement for a CAR to contain a complete DAG
The CAR format doesn’t have any strict limits on block sizes allowed to be stored, it is theoretically possible to store a single 1TB IPLD block in a CAR and it be considered valid
Filecoin deals are optionally CARs which are optionally indexed by storage providers; this format will still be successfully indexed by current Boost implementations

CAR Format Integration

This format does not strictly require changes to the CAR format, however in some cases it is useful to signal that a CAR is packed using this format, with reference layers packed before the full contiguous bytes of a file / blob. The CARv2 format offers a lightweight “capabilities” signalling bitmap in its prefix which can be used to provide a boolean “is IntactPack format” signal to consumers. This minimally just requires a CARv2 header prefixing the CARv1 payload.

It is theoretically possible to determine if a file is packed in this format by reading its contents, determining that the order of blocks is as expected and verifying the raw block chunks match the expected raw leaves contained in the reference layer. However, this requires both a full ingest of the CAR and a hashing of many blocks so is less optimal than capability signalling.x

Storage Provider Integration

A minimum requirement of HTTP piece retrievals is necessary, but a storage provider may optionally index and serve the individual raw leaf IPLD blocks of files within these CARs. Boost can be IntactPack aware by:

Reading CARs and looking for the IntactPack signalling capability in the CARv2 header.
Indexing the reference layer blocks as they exist in the CAR, mapping byte offsets within the piece
Indexing the virtual raw leaf chunk blocks as they are referenced in the reference layer, mapping byte offsets within the piece

Retrieval Tooling Integration

Retrievals can take multiple forms with this format, each of these can be built in to Lassie or other retrieval clients:

Piece retrievals using client-side knowledge of the piece layout: a client with a copy of the reference layer (or some form of it) may retrieve full or partial pieces of the byte blobs using its knowledge of:
1. The piece offset of the start of the file / blob data
2. HTTP range requests to fetch the bytes required
3. Incremental verification of bytes as they are retrieved using the reference layer to verify the virtual blocks as they are received
Piece retrievals with minimal client-side knowledge can bootstrap by fetching the beginning of a piece payload and decoding the reference layer as it is streamed; because this data is at the beginning of a payload (even for payloads with directories), a client can move on to retrieval as per the mode described above.
Plain IPLD retrievals can be supported with Storage Provider integration described above, where the Storage Provider’s stack is aware of the virtual blocks and can serve them as plain IPLD blocks, via any supported existing protocol (Bitswap, Graphsync, HTTP Trustless)
Client-side awareness of this packing can also speed up retrievals: by extending the HTTP Trustless protocol, with signalling via Accept and Content-Type headers to indicate that this packing format is used, a client could fetch the reference layer before the file / blob data and incrementally verify it using its knowledge of the UnixFS mapping of reference layer to the following bytes.
1. Note: There already exists a “skip raw leaves” proposal for HTTP Trustless retrievals that describes a mode that would transfer what is described here as the “reference layer”. There is likely significant overlap in the implementation of that proposal and this proposal. IPIP-0445: Option to Skip Raw Blocks in Gateway Responses ipfs/specs#445

willscott · 2024-01-04T08:45:02Z

what is the problem we're solving here?
- One of the pain points i've heard against car / unixfs preparation is that having to do something more than streaming the existing raw bytes of data a client has is "overhead". If we still calculate the unixfs reference, are we going to mitigate this pain point? In particular, with this design since all of the reference data comes first, a client needs to fully calculate the reference before beginning to stream data.
I worry that the re-calculation of what offset in the bytes array should be served as the bytes for a reference leaf ends up being somewhat error prone. What happens when we find a malformed variant of this type of car where the bytes aren't recalculated to the same reference leaves at the point when one of the leaves is asked for? Is it the receiver's responsibility to verify the reference structure as it receives an intactpack car and mark it as valid/invalid versus doing that work each retrieval request?

rvagg · 2024-01-04T09:21:27Z

what is the problem we're solving here

This is primarily viewed from a retrieval perspective, given the strong allergies that are appearing to using the plain old IPLD-based protocols. I keep on encountering a strong preference for just piece based retrievals instead. Unfortunately I never got around to doing the benchmarking I was going to do on this with the Project Motion data I had stored live to compare the same data fetched both ways to quantify what's been called "the IPLD tax" as it pertains to transfer speeds.

There also seems to be a growing meme that if the bytes aren't stored contiguously then it's inefficient. Juan tackled this recently, maybe it was at FilDev Iceland, and I broadly agree with him in that your data is almost never stored contiguously anyway because the various layers of storage abstractions all do funky things with it. But, it's a strong meme, and often these things sway users.

I also buy the argument that it'd be simpler for a storage provider to just enable piece retrievals. If we insist on them all exposing trustless, or even bitswap, then we make life harder. If the minimum viable SP stack with retrievals is piece retrievals, then we can make it work with this protocol. Trustless is a value-add on top of it.

What happens when we find a malformed variant of this type of car

I'm proposing that the majority of this work is done via a special API in go-car itself, not the standard API, but an API that opt's in to seeing the data in this way. With that, you just treat byte sequences derived from UnixFS offsets in the same way that we treat them today in "untrusted" CARs—the bytes are invalid for the CID and it's rejected, just as if you uploaded an invalid CAR today. You could make the Filecoin deal for it, but you're not going to get the data back except via piece retrieval without verification.

That can be done either on the server (Boost) when serving this trustless style, or on the client side when retrieving byte ranges via piece retrieval. The CID still exists in the reference layer and its digest must match the bytes in the looked up range.

rvagg · 2024-01-04T09:22:53Z

And yes, :ack: on the inefficiencies of double-streaming to first calculate the reference layer and then stream in the file bytes. That's a trade-off that should be acknowledged but my argument is that it's worth the cost if you want maximally flexible retrievability.

willscott · 2024-01-04T09:26:41Z

Maybe worth comparing this to a variant where the data is uploaded / downloaded fully as raw bytes (maybe as file-delineated objects in a car?) and the ipld references can be generated as a look-aside metadata by/on the SP for it's ability also serve that data with full IPLD semantics.

rvagg · 2024-01-04T22:23:41Z

One drawback there is that you can't do small-chunk trustless retrieval with just a piece endpoint. By packing the proof tree at the start you make this all navigable via piece endpoint with range requests. You can fetch the metadata first, verifying that all makes sense, then fetch the bytes - streaming the whole file, and prove it as you go with the tree you already have.

Making /piece/ + Range work for IPLD is one of the key pieces in this proposal. Of course that's just me saying that's an ideal goal, given what I see as the likely trajectory of SPs turning off bitswap and trustless given the overhead and just offering piece retrieval unless there's some other forcing mechanism that works. Right now "serves retrievals" is satisfied with piece retrievals and I suspect it'll stay that way unless Retriev or something else offers token incentives for providing data trustless.

CommPact gets part way there but essentially jumps out of an IPLD world, even while staying content addressed. It would be a shame to lose the ecosystem of tooling and techniques because that's all we can do.

masih · 2024-01-08T09:57:50Z

Thank you for writing this up as an issue @rvagg 👍

Regarding this proposal, I am mentally stuck on "Why Both?" question. Mixing raw bytes in a CAR file would make things certainly more complex than just pure CID+Block sections in the current CAR format. This brings me to ask the same sort question that motivated the CommPact proposal:

Where does IPLD representation makes sense from the end-user's perspective?

For me the answer to that is about understanding the original shape of data, minimal adoption barrier, contextual awareness, the "Semantic Web" avenue of thinking. It is less about a cross-cutting representation that would be applied to everything. I believe that the most impactful thing we can do is to help make IPLD representation shine in specific use-cases where existing state of the art representations fall short.

masih · 2024-01-08T10:16:56Z

CommPact gets part way there but essentially jumps out of an IPLD world, even while staying content addressed. It would be a shame to lose the ecosystem of tooling and techniques because that's all we can do.

I feel like I have not explained the core idea behind CommPact very well:

The whole point of it is that it keeps things IPLD friendly without forcing data prep until we figure out why one should go through with it.

mikeal · 2024-01-08T17:59:46Z

There also seems to be a growing meme that if the bytes aren't stored contiguously then it's inefficient. Juan tackled this recently, maybe it was at FilDev Iceland, and I broadly agree with him in that your data is almost never stored contiguously anyway because the various layers of storage abstractions all do funky things with it. But, it's a strong meme, and often these things sway users.

I don't think the fact that a filesystem implementation may repartition data it writes to disc drivers is a substantive justification for producing additional re-encodings of data writen to/form that filesystem which force additional copies of the data to be produced.

Filesystems may re-partition data in the process of maintaining a large logical store that continuously presents that data to other layers in the way it received it. The filesystem does not leak this abstraction the way IPFS encoding does when that filesystem is networked, the cost of partitioning appears in the place where that cost has a justification, in the filesystem next to the disc driver.

All data said to be available is made available "as it appears" and all of these systems maintain a stable API between protocols from network to syscall wrt "ranges of bytes" as there is no need for further consensus or adoption of the encoding structure the filesystem implements. Networked filesystems "network" the layer above the partitioning, rather than below.

IPFS, when consceptualized as an encoding rather than a cryptographic address for an abstract "range of bytes," presents several new standards which have to be adopted and agreed upon related to the encoding, and in the case of files this is merely to produce a semantic representation of "a range of bytes."

There's nowhere to implement this where it doesn't materialize as a forced copy of the data and there's nowhere that this won't show up as an additional overall cost of IPFS over any alternative technology because it doesn't do anything except produce an IPLD representation of a "range of bytes" which they already have.

If you want the merkle tree in order to do incremental verification then you can publish enough information in content discovery to produce any IPFS tree locally from a generic proof and then you aren't fighting over abstract encodings of what is addressed anywhere but the client which can produce whatever encoding it prefers. I wrote this up hastily before the holidays as "Encodingless IPFS" https://hackmd.io/@gozala/content-claims#Encodingless-IPFS

rvagg · 2024-01-10T03:59:18Z

@masih:

Mixing raw bytes in a CAR file would make things certainly more complex than just pure CID+Block sections in the current CAR format.

This is intended to be entirely compatible with the CAR format as-is. The "raw bytes" are still CID prefixed, it's just that the block is quite large and the CID won't be linked by anything else in the CAR, but you should be able to car ls and even extract that one big block for your file using the CID it comes with.

As @magik6k pointed out though, current go-car has some size limitations for blocks based on the fact that an LdRead is quite a common operation and pre-allocates the chunk to be read, so it'll likely have some trouble reading these CARs in its current form. But that can be addressed and it's not a spec limitation.

hannahhoward · 2024-06-06T23:27:38Z

Throwing some thoughts out there I've seen in the process of working on web3.storage.

The person who can generate a reliable index data is the person doing data preparation. From what I've seen all our sad days with boost emerge from scanning car files hoping they are written just right so our scanning software will work. Moreover, if the data creator signs the data index, even without verifying the underlying blocks, I have more trust in that index on its own than I do in an index generated from the person who happens to be holding it. Also, it's duplication of work -- why index the same CAR six different times on different providers?
We have a set of different network functions we know this network needs to provide user value. Fundamentally, storing opaque bytes that get read by range is a different job function than using an index to run queries of structured data encoded in opaque bytes and we need both for most normies to use the network. Our mistake is arguing about whether those functions should be seperate by locality, when we really should be thinking about whether those functions should be seperated by incentives. When you think from an incentive perspective, this discussion is largely moot. Filecoin in its current form incentivizes storage and retrieval of bytes. Everything else is the honor system and maybe FIL+. What we truly need is to incentivize people to hold indexes on data and use those indexes to serve retrievals from the underlying data.
Indexes themselves can be thought of as opaque bytes --- and users should probably pay to store them.
Even our best CAR designs so far don't encode true structured graph data (as opposed to UnixFS files) in ways that are usefully queryable, especially if the bytes storage doesn't happen to live where querying is happening. CAR indexes themselves tell you about blocks but not about structure inside blocks. If I wanted to backup a blockchain to Filecoin in a way that can be queried rapidly, I might want to write an index that looked very different than just "here are the blocks and these are their offsets" -- it might be useful to actually know the paths without having to read the blocks one by one -- potentially across a network. Even if that knowledge wasn't verifiable completely on its own, I might still derive value especially if it was signed by someone I knew had encoded that blockchain's data. Ultimately, my objection to "let's try to write a CAR different to make it work as block storage AND indexed data" is indexing data is a wide open set of possibilities for how people might want to do it. I think we can probably confine people to IPLD in most cases, but even IPLD has ways to index that go beyond "here are the blocks and their offsets".

hannahhoward · 2024-06-06T23:34:27Z

sidebar: anyone know what was the motivation for encoding lengths & CIDS next to the blocks?

like if we started from scratch on an indexed car, wouldn't it make sense to have:

header
index of CID+offset+length entries
just a bunch of bytes afterward

not for carv1 streaming, but for cars with indexes.

rvagg · 2024-06-07T06:30:03Z

I think the original design just came from Jeromy wanting to spew a bunch of blocks into a file and then slurp them back in again - both for filecoin chain data, and then to do the whole deal thing to get a CommP that in some way related to the IPLD structure; not a whole lot of thought beyond that afaik, the original go-car wasn't very sophisticated other than spew+slurp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval #513

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval #513

rvagg commented Jan 4, 2024

willscott commented Jan 4, 2024

rvagg commented Jan 4, 2024

rvagg commented Jan 4, 2024

willscott commented Jan 4, 2024

rvagg commented Jan 4, 2024

masih commented Jan 8, 2024

masih commented Jan 8, 2024

mikeal commented Jan 8, 2024 •

edited

Loading

rvagg commented Jan 10, 2024

hannahhoward commented Jun 6, 2024 •

edited

Loading

hannahhoward commented Jun 6, 2024 •

edited

Loading

rvagg commented Jun 7, 2024

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval #513

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval #513

Comments

rvagg commented Jan 4, 2024

Why Not Both? Packing Content for IPLD vs Piece (IntactPack)

TL;DR

Background

Vision 1: IPLD block focused

Vision 2: Filecoin is an opaque block-device

Finding “Why Not Both?” Approaches

Why Not Both: IPLD-first

IntactPack

Handling Very-large Blobs

Tooling and Protocol Integration

willscott commented Jan 4, 2024

rvagg commented Jan 4, 2024

rvagg commented Jan 4, 2024

willscott commented Jan 4, 2024

rvagg commented Jan 4, 2024

masih commented Jan 8, 2024

masih commented Jan 8, 2024

mikeal commented Jan 8, 2024 • edited Loading

rvagg commented Jan 10, 2024

hannahhoward commented Jun 6, 2024 • edited Loading

hannahhoward commented Jun 6, 2024 • edited Loading

rvagg commented Jun 7, 2024

mikeal commented Jan 8, 2024 •

edited

Loading

hannahhoward commented Jun 6, 2024 •

edited

Loading

hannahhoward commented Jun 6, 2024 •

edited

Loading