-
Notifications
You must be signed in to change notification settings - Fork 1.2k
RFC: js-ipfs Garbage Collection #2012
Comments
@Stebalien @magik6k could I ask for your input on the above proposal, and to correct anything I misunderstood about how go GC works @alanshaw suggested using a tiered datastore. It would track
Please CC anyone else who you think might have knowledge of this area |
Overall, that looks like a good proposal to me. Note:
That may not scale well. We can make this approximate ("accessed within the last day") to reduce the overhead of repeated access but this will still turn reads into writes.
Ideally, this would use a combination of frequency and recency. |
@Stebalien I was thinking that to minimize overhead I wouldn't maintain reference counts for blocks that are not pinned and not in MFS. Because
You're right, scaling will be a challenge, even just for reference counts. A couple of ideas off the top of my head:
|
We should ensure that a block sticks around for an hour at minimum even if not accessed. This allows us to manage expecations on how long data will persist on a preload node/gateway. Access time would really help to more intelligently clean up blocks, ideally we wouldn't collect a block that is often accessed but not pinned. However, storing creation time instead would at least allow us to clean up blocks not pinned and older than an hour without writing on read.
Remember it's basically any MFS command that you'd need to recalc references and that when you make changes anywhere in the MFS tree you propogate all the way back to the root.
I'm sure you know but the datastore is pluggable - not everything is stored as a file - s3, url, in memory. I would really love to have passive GC in JS IPFS. Personally I'd focus on the minimum viable (stop the world) first and then when you have that working you'll have a better idea of what's possible or not and then you can move on to a more advanced algorithm. |
Unless I'm mistaken, we shouldn't need to maintain reference counts for un-pinned blocks regardless. How does passive GC differ from active
True. For reference counts, we could also try some kind of write-ahead log. That should be pretty easy to maintain without too much overhead (one more write per block).
I'm not sure if the datastore has enough information for this. In go, at least, the datastore just deals with blobs of data and knows nothing about IPLD, references, etc.
We should definitely make some configurable grace period but, IMO, we should set this to 0 on our gateways. Users shouldn't expect us to store stuff for them.
We'd probably want both: sparse access time updates (i.e., record an access every hour) and a creation date. We can reduce the frequency of access time updates as the content ages. My rational is that data is usually ephemeral or permanent. The longer data sticks around, the longer it's likely to stick around so we don't want to just remove old data.
We can do this lazily on GC.
Yeah, that's probably the best approach. I'd just keep alternatives in mind before over-optimizing that approach. |
You're right I don't think we need to maintain reference counts for blocks that are not pinned and not in MFS. If they are pinned or in MFS we will need to, because for example a block could be part of a file in MFS and also part of another file that's pinned. Passive GC differs from active, manually run GC in that
I think that's a good idea 👍
Sounds like the consensus is to implement stop-the-world mark and sweep first, as we'll need it anyway, and then look at adding a more sophisticated reference counting algorithm that takes into account creation time and last access time. Does that sound right? |
Sorry @Stebalien I just realized what you meant - you're saying that if we're already reference counting we don't need to mark and sweep, we just remove all blocks that are not reference counted, is that right? If that's the case then maybe it does make sense to skip the mark and sweep implementation, and go straight for the more sophisticated implementation with reference counting. Thoughts @alanshaw? |
That's what I meant however, going the simpler route first is probably still the better solution. On the other hand, that depends on how pressing fixing this is. |
I've implemented mark-and-sweep in #2022 but there is currently no locking. go-ipfs uses a special Blockstore that implements the GCLocker interface: // GCLocker abstract functionality to lock a blockstore when performing
// garbage-collection operations.
type GCLocker interface {
// GCLock locks the blockstore for garbage collection. No operations
// that expect to finish with a pin should ocurr simultaneously.
// Reading during GC is safe, and requires no lock.
GCLock() Unlocker
// PinLock locks the blockstore for sequences of puts expected to finish
// with a pin (before GC). Multiple put->pin sequences can write through
// at the same time, but no GC should happen simulatenously.
// Reading during Pinning is safe, and requires no lock.
PinLock() Unlocker
// GcRequested returns true if GCLock has been called and is waiting to
// take the lock
GCRequested() bool
} @alanshaw @achingbrain do you think it makes sense for us to follow a similar approach? It would mean modifying js-ipfs-repo's interface to add methods for locking |
This doesn't actually have to go on the repo, you just need to make sure to take a read-lock when adding blocks and a write lock when garbage collecting. |
@dirkmc can you flesh out what it might look like and where we'd have to place code to lock/unlock? This sounds sensible on the surface but I don't have the time to dig in right now. Note that MFS has locking, it might be worth checking out how that is implemented. @achingbrain would be able ot answer questions you have about it. |
The go implementation uses a read/write mutex composed of:
When adding a file with As @Stebalien points out this doesn't necessarily have to be on the blockstore interface, that's just how it's implemented in go. |
In the future, it might be a good idea to make more explicit separations between APIs that store data and those that do not. Right now, there are APIs that “happen to store data until there is a gc” and APIs that store data indefinitely (pinned). If instead, API’s that do not pin data also did not store that data at all you wouldn’t be violating any assumptions users might have with smarter GC strategies. As things are now you’re painted into a bit of a corner and have limited opportunities for automated GC. |
A very small note to when this is released: warn the OrbitDB people once this is done because they aren’t calling |
Thanks @satazor. Could you point me to the piece of code where that happens? Note that |
@dirkmc They are using |
We should get #1867 implemented asap. |
@daviddias and @alanshaw regarding your comments in #2022 (comment)
Quick reminder: there are different types of data storage on web platform. We use Temporary by default to avoid annoying user prompts at the cost of theoretical data purge due to LRU policy. There could be a configuration option to enable opt-in for Persistent storage if app developer is okay with trading user convenience for persistence guarantees, but that is a separate topic for other time/PR. By default, JS IPFS in browser context should not display any prompts, but ship with smart GC that is aware of how Temporary storage works and ensure GC happens before Origin storage limits are hit, to maximize the lifetime of IPFS cache.
|
resolves #2012 Depends on - [x] #2004 - [x] ipfs-inactive/js-ipfs-http-client#992 - [x] ipfs-inactive/interface-js-ipfs-core#462 - [x] achingbrain/mortice#1 TODO: - [x] Core (mark and sweep) - [x] CLI - [x] http interface - [x] interface-js-ipfs-core tests ipfs-inactive/interface-js-ipfs-core#462 - [x] nodejs-specific tests - [x] Locking - [x] Tests for locking
This issue is to discuss how best to implement Garbage Collection in js-ipfs
The go-ipfs Garbage Collector
We would like to learn from the experience of go-ipfs when building a Garbage Collector for js-ipfs.
Triggers
ipfs repo gc
--enable-gc
causes GC to runStorageGCWatermark
% ofStorageMax
(90% of 10G by default)GCPeriod
(1 hour by default)when a file is added to unixfs(currently disabled)Algorithm
Source code
Note that
bestEffortRoutes
currently only contains the MFS rootProposal for a js-ipfs Garbage Collector
Requirements
ipfs repo gc
: mark and sweep - remove all unreachable blocksipfs daemon --enable-gc
causes GC to runStorageGCWatermark
% ofStorageMax
(90% of 10G by default)Algorithm for passive GC
The text was updated successfully, but these errors were encountered: