-
Notifications
You must be signed in to change notification settings - Fork 108
Proposal: GraphSync (A) #66
Conversation
These are the current thoughts about GraphSync written down in a single document. This also contains the results from the Deep-Dive session at the Developer Meeting 2018 in Berlin.
I'm having trouble getting a picture of the protocol from this document, even as a starting point. I'm seeing:
Where the "Consumer" for selector type Y "executes" selectors of type Y, puppeting the GraphSync "client". Is that correct? If so, I'd like to be careful to avoid putting too much logic in the "Consumer" as we don't want implementing new selectors to be hard. |
I think you are correct. The point of the Consumer is that the GraphSync part on the Server can be pretty minimal. Implementing new selectors would then mostly happen in the Consumer as the Server has already the basic building blocks implemented. |
/cc @ajbouh |
block: message.block | ||
} | ||
} | ||
// Server has only a subset of the requested DAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this saying that we are adding an "I dont have this" response message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be my plan.
@vmx do you have plans around wire format changes? Also, any thoughts towards real world performance of such algorithms? A lot goes into making bitswap both fast, and not wasteful. The duplicate blocks issue is pretty significant, and worth designing solutions that take it into account. For example, in the happy case, we can ask one person for the data, they can tell us what they don't have, and we can then ask others for that data. But that relies on us trusting that the other peer will be honest, and fast. |
@whyrusleeping Currently GraphSync is becoming more of a RPC cal thing, not a real Bitswap replacements. Perhaps GraphSync could then be used as building block. While implementing what I think GraphSync is, I get more and more doubts that it is useful. |
@vmx don't get me wrong, I think GraphSync (in some form) will be incredibly useful. The hard part is just figuring out what that looks like. I've been grappling with the latency vs bandwidth waste vs centralization tradeoffs lately, and its tough. Some tools that i'm thinking might be useful:
|
Jumping in again to wave hands about graph manifests. I brought this up in cursory notion at this session, but have had some time to marinate, and think it's a concept worth revisiting. For every discrete DAG g one can construct a manifest which is a second DAG of only block names and links (no content): These manifests are relatively small. If expressed as a set of two lists (one of array-positional links and one of names/hashes) it should be possible to represent many gigs worth of IPFS DAG content in < 100kb of CBOR. IMHO, the power of IPFS is derived from the dual expression of blocks as both graphs and flat lists. This is also a fault line that shows up in the seam between bitswap and graph sync. I think graph manifests are a missing "primitive" from IPFS. These manifests have a few properties that are nice:
If I'm planning on efficiently planning my requests for blocks, I really want this manifest as soon as possible. Once I have a manifest I can trust I know a shit tonne of important things:
So this might be a graph-sync thing, but it could also be a structural outgrowth of a bitswap session: establish a trusted graph, then divy up block requests among the session. If block sizes are also in the manifest, one can match larger blocks to faster peers. The point being, a manifest gives me a primitive to plan my block requests, and makes optimizing request planning a matter of better matching Downsides:
Both of those downsides can be mitigated by implementing manifests as a protocol, where peers can dynamically generate manifests of arbitrary graphs & subgraphs, which is the only reason I think it should exist at the IPFS layer. Adding in Graph manifests is kinda like turning IPFS into dynamic bittorrent 🤷♂️. |
I wrote this yesterday, before the two new comments from @whyrusleeping and @b5. I just keep it like that and post a follow-up comment on how this all relates to each other. Definitions
IntroI finally took the time to code what I had in mind (based on this PR). After tackling a "give the full sub-DAG", I wanted to tackle an obvious candidate for GraphSync: UnixFS v1. I then got deep into a rabbit hole. I thought I'll just execute the UnixFS Engine code on the Server, so I don't have to re-implement that. It would then return all the Nodes it's visiting, which will then be the ones that are needed in order to perform the same query on the Client. It turned out that such a RPC like call isn't really useful. It won't serve the purpose of being something that is a better Bitswap. If you'd have a subset of that Graph already, you'd still get a lot of Nodes you won't actually need. I came to that realisation after reading @whyrusleeping's comment (thanks!). I then thought I need to go back to the drawing board and talk to lots of people with more knowledge as I really hit a wall and need to start from scratch. A better waySuddenly I had my own ideas and after a bit of thinking, I think I found a way to move forward which aligns with the stuff I already have. Make GraphSync less powerful than I intended and let the application layer deal with it. GraphSync will only support getting a full sub-DAG combined with a maximum depth. So if you want to get a single Node, you just have a maximum depth of 1. Let me use UnixFS v1 as an example on how this is still powerful enough. Getting a full fileThe easiest case is if you request the full contents of a file. It's just the full sub-DAG of a specific path without any depth limitation. Getting only the first few bytes of a fileYou wouldn't want to transfer all Nodes of the file as only a small part is needed. For such a traversal you would need to keep track of the sizes of the Nodes that were already transmitted. That's a lot of logic and out of scope for GraphSync. Instead UnixFs needs a bit more logic. It could fall back to how things currently work with Bitswap and request one block after another. Or it could be smarter and e.g. request all children of a certain Node. This would be a request with a maximum depth of 2. It could then inspect those nodes and do subsequent requests, e.g. for full sub-DAGs from some Nodes without an maximum depth limitation. Getting a slice of a fileThis case is about getting only a few bytes combined with a certain offset. It works similarly as the case above, which is without the offset. Getting another slice of the same fileSo far the cases would've work just well with the way described in the intro, doing a UnixFS traversal on the Server and transmitting all visited Nodes. But this case is more interesting. If you want a slice of a file you previously got another slice from, it could be that you already have some of the Nodes stored locally. It would be a waste to request all those again from the Server. The current system handles traversals where some Nodes are missing well, thanks to Bitswap it will get those missing Nodes from the network. GraphSync can't be used in such a transparent way as more context is needed (you could use GraphSync like Bitswap with requesting always with a maximum depth of 1, but that wouldn't improve anything). The traversal would signal that the requested Node is not available locally and then you can decide what to do. It could be that you request the full sub-DAG, or perhaps only the direct children. It's up to the current context and traversal that is going on, what is best suited. If such a signal for a missing Node is provided by the traversal, it can be re-used for partial GraphSync replies. If you request a full sub-DAG it could well be that the Server has only a subset of the data. The logic already in place could then deal with such conditions. OutroThere's still a lot of open questions around how to process those incoming Nodes from a GraphSync request, but at the moment I think those are just implementation details that can be solved. |
@whyrusleeping I fully agree that the hard part is what GraphSync should look like. That's exactly what I struggle with. My "better way" addresses the "NACK response" part. It could be extended to a "do you have the data?" request, although I guess if a peer has the data, we would want it anyway, so having a "NACK response" would be enough. Or a "would you send me this?" could also be combined with @b5's Graphs Manifests and would not only reply with information about a single block, but with the whole sub-DAG this block links to. Provider Hints could be the Graphs Manifests. @b5 Thanks for the detailed information on the Graph Manifests. I can see how those could help to optimise the things I described in my "better way". |
Something related that I've been thinking about is creating an abstraction above a Block Store that stores metatdata about whether or not the store contains the entire graph linked to in the block. This need came up in a proof-of-concept I wrote for "pushing" a graph called graph-push. Essentially, it exposed both a "shallow" and "deep" push based on whether or not the service has a block. Pushing this decision to the client was highly problematic, it means the client would have to choose between being either fast/efficient or reliable.
The reason I bring this up is, I don't see how a singular manifest scales well for very large graphs. It means that you either keep a static representation of the graph index for every CID or you do a fairly expensive query over a simpler index every time you generate the manifest. The manifest could also be incredibly large which leads me to think about all kinds of performance concerns. You can image solving these issues with depth definitions and options but this starts to get very complicated very fast and is always going to have cases that make any solution more or less optimized (deep vs. shallow graphs for instance). It may be more flexible to simple be able to say "I contain all the blocks in the graph for this CID" or "I don't know how much of this graph I have." The client should be able to figure out the best way to prioritize getting the graph based on this information. It can traverse down the graph with a peer that has some of the data until it hits a block that peer doesn't have. As it makes its way down the graph and has to find new peers in a very large graph it will see more peers that have the entire graph and can prioritize those peers. |
That's a really good question IMHO: how much could a graph manifest practically hold? If it's not enough info, then it's a bad design choice. Given that @vmx's better way might be able to make use of these manifests, I've coded up a quick spike implementation to get a feel & see if this is worth discussing further: Example Codehttps://github.com/qri-io/go-ipld-manifest There's a test in there that runs some extremely rough numbers of a 4-tiered Dag, where the first three tiers are small "link-only nodes" and the bottom ~3k nodes are all 256kb blocks. running that test with
So based on this very rough example, you could get around 1 Gig of content represented in a single manifest if stored as CBOR. I'm assuming a manifest should fit in a single block for caching purposes, but that may not necessarily be true. To keep the example "real" (lol) I've added in a list of block sizes to the manifest. Weather that's acceptable is, well, a question for y'all. It's worth noting this total-storable figure will drop with the switch to base32 cids.
I'm assuming we're operating in a peer-2-peer environment, and having trouble seeing how me (as a peer) having a list of all the blocks I need before I go get them isn't worth the trouble. I'm guessing there's details & a good war story here that I'm having trouble getting to b/c of the client / server terminology. As far as I understand, we're trying to figure out a protocol and implementation to retrieve a subgraph of a DAG with providing a CID plus some meta information, which clearly has a connection to bitswap, the question is where to draw lines between those APIs, and what API GraphSync should expose (which I fully trust @vmx will handle ;) ). I don't think graph manifests solve this problem. I'm proposing manifests are a missing building block in that process, and that there are other use cases for a graph manifest outside of graph sync (the big one being a proper progress indicator).
There's a third option: only keeping manifests of important CIDs. In the common use cases that means root hashes. No need to keep a manifest of every CID, but being able to generate a manifest of any graph is a useful property. Manifests of immutable content are also immutable, so caching here is a win, but not vital. Being able to generate manifests as protocol level would alleviate the need for users to see this stuff, and open the door to future work with subgraph manifests. The code example provided isn't usable as a measurement of performance b/c it's not doing any real node resolving. If network is involved, yes this will be a very expensive operation that should be avoided entirely IMHO. (@mikeal here I think we're in agreement that a peer either having full graph or not is a vital piece of info for decision making). If the peer has the full graph locally, calculating a manifest should be cheap. How cheap depends on plumbing I'm not super familiar with. Performance could indeed be a reason for not using the concept of a manifest at all, but to me if we can't generate a fast manifest of a complete graph we have locally, something is wrong.
I have two concerns here:
To me the goal of a graph manifest is to get the client/requesting peer out of an information deficit as early as possible in the graph-sync process, allowing the requester to perform coordination duties, and to be able to concoct different strategies for delegating requests to peers in parallel. To me those "coordination duties" are where the graph sync work starts. If others can benefit from having manifests (I know we would), then I think it's a candidate for pushing lower into the stack.
To me this is, like, super solid, which I interpret as part of the "just store your graph information in a graph database" school of thought. This has been suggested elsewhere (I think @lgierth is one of it's proponents). A graph database / index does sound smarter than one-off manifests, but I think even in that context they can work in tandem: generate a manifest from the graph DB so the requester can update it's knowledge of the merkle forest. Sounds like a lot of planning work that's above my pay grade ;). |
@b5 for standard 'wide' graphs, what is the advantage of the graph manifest over simply doing a breadth first search over the dag? |
Looks like awesome work!
Datasets like ImageNet have 10^6 entries (image files) in a single
directory. IPFS really falls down when trying to handle scenarios like
this. In the abstract, a manifest sounds like a good solution. Though it
certainly won't fit in a single block!
…On Tue, Sep 25, 2018, 11:33 b5 ***@***.***> wrote:
The reason I bring this up is, I don't see how a singular manifest scales
well for very large graphs.
That's a really good question IMHO: how much could a graph manifest
practically hold? If it's not enough info, then it's a bad design choice.
Given that @vmx <https://github.com/vmx>'s *better way* might be able to
make use of these manifests, I've coded up a quick spike implementation to
get a feel & see if this is worth discussing further:
Example Code
https://github.com/qri-io/go-ipld-manifest
There's a test in there that runs some *extremely* rough numbers of a
4-tiered Dag, where the first three tiers are small "link-only nodes" and
the bottom ~3k nodes are all 256kb blocks. running that test with go test
-v:
manifest representing 4043 nodes and 1.024210Gb of content is 253.921997kb as CBOR
So based on this *very rough* example, you could get around 1 Gig of
content represented in a single manifest if stored as CBOR. I'm assuming a
manifest should fit in a single block for caching purposes, but that may
not necessarily be true. To keep the example "real" (lol) I've added in a
list of block sizes to the manifest. Weather that's acceptable is, well, a
question for y'all. It's worth noting this total-storable figure will drop
with the switch to base32 cids.
Pushing this decision to the client was highly problematic, it means the
client would have to choose between being either fast/efficient or reliable.
I'm assuming we're operating in a peer-2-peer environment, and having
trouble seeing how me (as a peer) having a list of all the blocks I need
before I go get them *isn't* worth the trouble. I'm guessing there's
details & a good war story here that I'm having trouble getting to b/c of
the client / server terminology. As far as I understand, we're trying to
figure out a *protocol and implementation to retrieve a subgraph of a DAG
with providing a CID plus some meta information*, which clearly has a
connection to bitswap, the question is where to draw lines between those
APIs, and what API GraphSync should expose (which I fully trust @vmx
<https://github.com/vmx> will handle ;) ). I don't think graph manifests
solve this problem. I'm proposing manifests are a missing building block in
that process, and that there are other use cases for a graph manifest
outside of graph sync (the big one being a proper progress indicator).
It means that you either keep a static representation of the graph index
for every CID or you do a fairly expensive query over a simpler index every
time you generate the manifest.
There's a third option: only keeping manifests of important CIDs. In the
common use cases that means root hashes. No need to keep a manifest of
every CID, but being able to generate a manifest of any graph is a useful
property. Manifests of immutable content are also immutable, so caching
here is a win, but not vital. Being able to generate manifests as protocol
level would alleviate the need for users to see this stuff, and open the
door to future work with subgraph manifests.
The code example provided isn't usable as a measurement of performance b/c
it's not doing any real node resolving. If network is involved, yes this
will be a *very* expensive operation that should be avoided entirely
IMHO. ***@***.*** <https://github.com/mikeal> here I think we're in
agreement that a peer either having full graph or not is a vital piece of
info for decision making).
If the peer has the full graph locally, calculating a manifest should be
cheap. How cheap depends on plumbing I'm not super familiar with.
Performance could indeed be a reason for not using the concept of a
manifest at all, but to me if we can't generate a fast manifest of a
complete graph we have locally, something is wrong.
It may be more flexible to simple be able to say "I contain all the blocks
in the graph for this CID" or "I don't know how much of this graph I have."
The client should be able to figure out the best way to prioritize getting
the graph based on this information. It can traverse down the graph with a
peer that has some of the data until it hits a block that peer doesn't
have. As it makes its way down the graph and has to find new peers in a
very large graph it will see more peers that have the entire graph and can
prioritize those peers.
I have two concerns here:
- This conversation is happening over the network. Network is
expensive.
- The logic that drives this is IMHO, really hard when you put
multiple peers speaking concurrently into the mix.
To me the goal of a graph manifest is to get the client/requesting peer
out of an information deficit as early as possible in the graph-sync
process, allowing the requester to perform coordination duties, and to be
able to concoct different strategies for delegating requests to peers in
parallel. To me those "coordination duties" are where the graph sync work
starts. If others can benefit from having manifests (I know we would), then
I think it's a candidate for pushing lower into the stack.
Graph Store: Boolean CID index on top of Block Store.
To me this is, like, super solid, which I interpret as part of the "just
store your graph information in a graph database" school of thought. This
has been suggested elsewhere (I think @lgierth
<https://github.com/lgierth> is one of it's proponents). A graph database
/ index does sound smarter than one-off manifests, but I think even in that
context they can work in tandem: generate a manifest from the graph DB so
the requester can update it's knowledge of the merkle forest. Sounds like a
lot of planning work that's above my pay grade ;).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#66 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcnUjqrQg4dTaqsV0kxiI09j828Yt3ks5uencUgaJpZM4VRFQD>
.
|
locally or over the network? Locally the advantage is very little if any. To me the advantage shows up over the network, giving a requesting peer a small payload of trustable knowledge of what they're after. I think they'd make a great extension when kicking off a bitswap session. For any DAG with less than some threshold of blocks, a manifest would be overkill, and should be skipped. |
@b5 i'm talking about over the network. Say i'm fetching a really large file, If i use a selector to fetch the first three layers of the graph, it should give quite a few hashes to request further, in a trustable way, without being too much data. |
Also potentially relevant for some, an issue I wrote up on selectors a while back: ipfs/notes#272 (comment) |
@whyrusleeping using the first example from your selector thoughts:
One approach would be to optionally return a manifest of H, or at least the hash-of-manifest-of H if the peer has a manifest on offer. Peer could elect to not compute a manifest for a number of reasons, so it should be optional. In this context, the manifest is the "quite a few hashes to request further" without being too much data. It's "trustable" in the DHT sense, where manifests should probs be vetted against multiple responding peers or something. If you do end up with a trustable manifest, you can now construct selector-like queries locally & just ask for blocks, because you have the entire graph, just not the content. You don't know which peer has which blocks, but that's less relevant than knowing what blocks you need. Recursive fetching strategies that hone in on outstanding blocks become a thing, which should cut down on complex selector construction & fulfillment, and parallelize across peers better. |
Any solution we go with here is going to be more optimized for one case vs another. That said, I don't think that we should be using block sizes as optimal as 250K as our go-to use case. Optimal file chunking for large binary files like media would be based on keyframe windows and with text files we probably want to use a rabin chunker for better updates, which will result in many blocks of a much smaller size. I think that we need a better idea of what use cases we're trying to optimize for. I can't think of a use case for large structured data where a manifest is not prohibitively expensive. As a general rule, the more structured data is the larger the indexes are, and a manifest is effectively an index.
Couple notes here. Whether or not a peer has the full graph is a single bit, we could just stick it in the DHT and let the client use it when prioritizing peer selection. Being that network is expensive, I don't see why we'd want clients to pull down the entire manifest when they may only want a portion of the graph. |
You make good points about needing to outline the use cases we're targeting. Let me ask some silly questions: Without a manifest of some kind, how will someone know what entries they want? Are we assuming that IPFS should always rely on out of band coordination for distribution of CIDs? This out of band bit seems like the implicit assumption in most of IPFS's design. I believe it is a source of many surprising (and disappointing) performance characteristics. |
Reading through this again and I'm starting to see some big holes in this approach.
I don't quite see how we're going to securely and efficiently put this much logic on the "server" side of the transaction. It's a nice idea in theory to just have one end of the connection start sending blocks without the need for another request but this opts us out of any opportunity to not send blocks one side already has and the client can't really be responsible for parallelizing across multiple peers if it isn't responsible for the traversal of the graph. Similarly, I don't see how a client could make use of a manifest. There's no guarantee that the peer isn't lying about the manifest, although you could detect inconsistencies as you parse the blocks and go from there. Other peers could make use of a client's manifest when sending blocks back, but this still isn't sufficient because the client's block store can contain several trees and it could have a sub-tree but be missing the link between that sub-tree and the root of this particular tree, so it wouldn't have appeared in the manifest for that root but was probably in another. This is going to happen a lot in static site deployments, people have lots of similar shared assets across sites and there are changes to those assets in subtrees all the time. |
One more thing, can we assume fully duplexed connections are available? If so, there are ways that we can optimize performance by concurrently asking for blocks rather than trying to come up with ways for one end to send many requested blocks serially. |
Ok, might be worth backing up to make this a little clearer with a story. First, the selector conversation is separate from graph manifests. For the sake of argument, let's put selectors aside for a second and walk through an example of how this might work. First, I add some content to IPFS, which generates the classic DAG and CID Later on peer Sandra comes along and asks me for the content at CID Sandra's been asking a few others for hash of Em then connects and asks both me and Sandra for CID A. This time we both populate the manifest field with At this point Em has a complete, list of every block in CID Before Em does anything else em does a set intersection between their local blocks and the blocks listed in the manifest. Turns out Em already has 15 of the 70 blocks listed in the manifest, so they can skip asking for those. Em wants the whole DAG, so they do the easy thing & just cut the remaining list of 55 blocks in half, asking me for one half and Sandra for the other. Sandra's quicker than me and finishes her list first, so Em cuts my remaining list in half and gives the other list of blocks to Sandra again to fulfill, letting my weak-sauce tethered 3G connection close out the 4 blocks I can contribute. While this is happening Em is seeing a progress bar, because they know exactly how many blocks are left, which they have, and which they need. One day in future versions of IPFS Em might use that information to construct fancy selectors that carve up the manifest, asking for a subgraph of available content. If the manifest came back with, say a larger size than Em's allowed repo, Em may elect to abort the process entirely before asking for more blocks. While blocks are transmitting Em is doing the usual checking of the blocks coming over the wire. If at any point the blocks Em's requesting aren't adding up to correct hashes, the whole process can be aborted. In this example em's local 15 blocks happen to be a subgraph that adds up to a file Peers are incentivized to not lie about manifests because If a peer ever transmits a malicious manifest and you acquire the real manifest, you know they're misbehaving, because there's a deterministic algorithm connecting the content to the manifest. Because you can generate the manifest locally once you have the full DAG, you can check for malicious responses after the fact. Ideally, all of this is pretty low level, and structured as an opt-in speed-up-happy-path, falling back to the way things work today (because it works!). Finally, it's worth pointing out this approach is chunking-strategy agnostic. Graph manifests will work on any DAG. To me, selectors enter the conversation after manifests. Manifests by no means answer all the questions you would want to ask of a DAG, but a manifest makes constructing those selector queries simpler and faster. As @vmx mentioned something akin to manifests would be something graph sync builds upon. I think @ajbouh hit the nail on the head with this:
I'd be happy to outline how I plan to use graph manifests out in IPFS userland, but would rather avoid clogging all y'all's inboxes if we don't have clarity on the concept 😄. |
@b5 Hrm... I'm still not seeing how much the manifest improves on the situation. For the 15MB file example, you end up with a 1 deep graph, where the root node has links to all the leaf nodes. So the root 'A' of that file contains all the information that the manifest would. Then, at some point the graph gets too big for the manifest file to be represented as a single object, so you would have to shard it. This runs into the same issue as before... If I could have a selector that said "Give me all non-leaf nodes in graph A" it would not be too much more data than the proposed manifest, and actually contain data that we need for the graph. |
@b5 looks interesting, though I can't dig into it in-depth. Could you try to build a manifest over one of my datasets, see how that behaves? ( yes, I still need to clean up the go-ipfs patch to render the metadata locked in this set, $real-world is really messing with my available time ) |
Good to share here a video that just got uploaded, Volker's talk on GraphSync from LabDay. |
@jbenet and @whyrusleeping produced a specification for GraphSync and IPLD Selectors during the Go IPFS Hack Week. It contains all the thinking for these two systems from the last 3 years + thinking about this (first record was Jeromy's Bitswap Talk, circa Dec 2015). You can watch Juan's presentation on the GraphSync and IPLD Selectors Spec here |
@jbenet can you provide the docs produced ASAP? I believe that @vmx and @mikeal are still working on the direction that came out of their recent discussions vs leveraging the spec you produced. @vmx @mikeal one of the valuable outputs of the discussions in Glasgow, is that independently of who is right when it comes to GraphSync design, any GraphSync design and implementation will have to go through a series of tests/benchmarks with the multiple graph topological. Can you list those here? AFAIK we at least have:
@hannahhoward I believe you are working on benchmarks for a potential GraphSync for go-ipfs, do you have a list of topologies you are about to test for? |
That latest set of docs for IPLD Selectors should also be linked to on ipfs/notes#272 :) |
Looking at Juan's screen in his talk and nothing in or linked to on this page matches what is up on his screen :( |
This is a good starter list. Once we have the benchmarks somewhere we can always add data sets, I'd rather just get a few of these going and iterate than try to front-load a ton of work when we're currently operating with zero benchmarks. The much harder part of this will be multiplying the data sets with peer/network conditions. For each of these data sets we need to benchmark situations in which:
The issue with the old design wasn't so much that it didn't work well under a specific data-set but that it completely broke down once you were getting the set from multiple peers. |
@mikeal you might be interested in the tests i wrote in go-bitswap recently: https://github.com/ipfs/go-bitswap/pull/8/files |
Hey folks, sorry for delay. I’ll put the docs we made in Glasgow up in the next day |
|
Other notes about the manifests approach discussed here:
And thoughts on provable versions of these. (not relevant for the short term -- <1yr)
|
Thanks for the info. This is super helpful! The whole reason for making a stink comes from pain points we've uncovered building user experiences on top of IPFS: I want to show our users meaningful progress bars when fetching a DAG. That's it. It's a small point, but an extremely crucial one. Unless I'm missing something, IPFS peers lack the info needed to show how many blocks remain, and that they're arriving in parallel. Not being able to show "bittorrent style" progress bars means we can't build UI that shows users one of the greatest upsides of block-based content addressing: when performing a fetch, there's a chance your node already has some/many of the blocks you need. If you happen to be building, say, a version control system, there's a very high chance you have lots of the necessary blocks already. Nothing else I've seen has this property. It's the detail that made me pick IPFS over dat, and I really want to show it off to the world in a way I think they'll immediately understand. It's absolutely true that most (all?) manifests would be pretty close to the size of "the whole graph minus leaf nodes". The entire manifest is a tax. The advantage of a manifest is not in the size, but in getting a fetching peer out of an information-poor context as soon as possible. The tax should be covered by being able to make smarter choices with that knowledge. Anyway, I'm just after progress bars. Building this sort of thing in userland is, well, tough. As for Provable versions of manifests, that's well above my pay grade I'll happily leave that to y'all 😉. |
@b5 I think we can solve the progress bars problem (especially in your ipld usecase) by adding a small amount of extra metadata in each node that lets us know roughly how many nodes are behind each link. You should actually be able to do this today by simply adding that to your existing datastructures. Does that seem reasonable? (also, we should open a new issue for 'progress bars on ipld' or something) |
Totally. Apologies all (particularly @vmx), I've hijacked this thread for long enough. I'd be happy to kill the manifest discussion and move the progress-bar chat Thanks all! |
Here is a playground for you -- ipfs/interop#44 (comment). Customizable exchange files tests between JS and Go (go<->go, go<->js, js<->js) that test for large files (as large as you want) and directories (as nested as you want). It is pretty easy to try it out with different bundles of go-ipfs and js-ipfs, check the Readme https://github.com/ipfs/interop#run-the-tests |
I'm closing this PR. The contents lives on in the design history, see #159 for more information. |
These are the current thoughts about GraphSync written down in a
single document.
This also contains the results from the Deep-Dive session at the
Developer Meeting 2018 in Berlin.
This document should be seen as a starting point, not as a complete, ready to merge thing.
/cc @b5 @diasdavid @jbenet @mib-kd743naq @Stebalien