-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doubts about the relationship between compression, deduplication, sending/receiving incremental snapshots. #185
Comments
Most probably you don't need to do that again but @Zygo may know better details.
Both computers A and B still have the non-deduplicated first read-only snapshot. You'd need to remove that old snapshot on both. The result should be a deduplicated second snapshot on both computers, which is similar to Q1 so the same may apply there: I'm not sure if it is an exact copy with the same shared extents or if it may deviate from the source original.
Bees may re-compress files, and I'm pretty sure it compares contents of the uncompressed files, so a mixed environment of compressed and non-compressed files won't worsen your dedup hitrate but there are other factors that apply here which may play into how well bees can work with compressed extents (because those are fixed at 128k size). Depending on the source, deduplication tends to reach a much higher reduction in storage space than compression. If your data is highly deduplicatable, you may even not care about compression at all. |
Thanks for your reply, it has increased my understanding, although it is not a very definite answer. |
Some of the data will be deduplicated. btrfs send can replicate cloned extents; however, to keep kernel usage at sane levels, send has restrictions about how many references it will track and replicate. If the restrictions are exceeded, a simple copy command is emitted instead, and the receiver will have duplicate copies of extents where the sender has deduplicated references to a single extent. The result at the receiver is some point between the maximum and minimum possible deduplication. It is not trivial to estimate what that point will be for a given data set.
The second snapshot is equivalent to the result from the previous question: on B it will be somewhere between not deduplicated and fully deduplicated. The first snapshot is not deduplicated or modified in any way on computer B.
Deduplication works on compressed and uncompressed data interchangeably, i.e. duplicate uncompressed data blocks can be replaced by a reference to a compressed copy. Each extent in btrfs has a separate compression status, so files can contain a mix of compressed and uncompressed extents.
Compressed data requires about 4x more hash table space on various test data sets. You can choose whether to increase the hash table size, or keep the hash table size and accept a lower dedupe hit rate.
Yes and yes.
Note that the above answers about compression apply to bees. Other dedupers on btrfs handle compressed data very poorly or not at all. |
Thank you for your reply! It removes most of my doubts.
https://btrfs.wiki.kernel.org/index.php/Btrfs_design |
Compression is per extent. Btrfs does not compress files like a zip file as it would be impossible to seek into the file then. That's why compressed extents are always 128k maximum size: It allows btrfs to seek near the position you'd like to read, and then decompress only a tiny part of the file to not have too much decompression overhead, then finally seek to the correct uncompressed position. Actually, if btrfs starts out writing a file with compression, it will stop doing compression if it finds the compression ratio to be too low for a certain amount of data written. It also has fast statistical heuristics to check whether an extent could reach a useful compression ratio at all and would then just skip the compression step. This way, a file in btrfs naturally is a mixture of compressed and uncompressed extents. |
Suppose I have file A, and I enable compression and write it to btrfs disk. |
Bees doesn't look at files at all, it only cares about extents - so it replaces A-extents with A-extents, at it may not even prefer a compressed version over an uncompressed version. The reason why you see file names logged in bees is just because it needs to find a file referencing such an extent to get a file handle to actually read the contents. But it is really only about extents, not files. Actually, bees may sometimes rewrite extent A to a new temporary file, breaking it up into shareable contents, and thus enable orphaned extent parts to be released from the file system. This may reset how and if the extent is compressed. Also, compressed extents do not work the may you imagine here: A compressed part of a file is made up of 128k extent chunks, this is why Zygo wrote that those will occupy a lot more meta data and hash table space. To conclude: Yes, it will combine the A part into shared storage but it does not guarantee the direction of the operation, it may actually replace file A with the A part of the AB file. And if it decides to rewrite the A part to a temporary file, you may end up with changed compression mode - depending on whether or how you disabled compression, and depending on the kernel heuristics in btrfs for compression. |
Does this rewrite respects current compression mode set by mount options and btrfs? I also found this in missing features:
This makes me wonder whether |
@Zygo I read that And This makes me wonder, is the snapshot received on the remote really deduplicated, or rather still contains duplicated data? |
"system calls provided by btrfs" includes the clone range system call, which creates a reflink extent. send will emit these instead of copies where it can, and when it does, the data is not duplicated on the receiver. v2 send format introduces more cases where clones are possible, but v1 send streams still include a lot of clone commands. receive does exactly what send tells it to do. All of the intelligence is on the sending side. send streams are serialized system calls and data packets. You can decode one with Compression method and level is determined by the mount option. If no mount option is provided, the btrfs default method is zlib and level is 3. bees will always compress when it rewrites an extent. Future versions of bees might try to match the original extent's compression method, or have a configurable compression method to use for extent rewrite. |
@Zygo Would adding an option in configuration for |
It seems that the current v1 send still doesn’t handle deduplication optimally. On my local machine with beese, two snapshots and the original, modifiable dir, it takes 27GB. Using the same compression method and level, but without beese, the two snapshots take 31G. |
send definitely does not handle dedupe optimally. It will be somewhere between no deduplication at all (i.e. the total size of both snapshots) and full deduplication (i.e. the size of the original). |
This is a beautiful tool! Thanks to the developers!
I have some questions about the relationship between compression, deduplication, sending/receiving incremental snapshots.
1.I deduplicated on computer A, then created a read-only snapshot and sent it to computer B. has the snapshot received by computer B been deduplicated? Do I need to deduplicate data again on computer B?
2.I did not deduplicate on computer A, then created a read-only snapshot and sent it to computer B.
Then I deduplicate on computer A, create a new snapshot and send an incremental snapshot to computer B. when computer B receives this new snapshot, is it equivalent to deduplication?
3.When compression is enabled, does deduplication work on compressed or decompressed data?
Does compression weaken deduplication?
Can their combination maximize disk space savings?
Should I use compression when I use deduplication?
I tried to find the answer on btrfs's official wiki, but failed.
Is there a hierarchical abstract diagram of btrfs?
Thank you again for reading these a little too many questions.
The text was updated successfully, but these errors were encountered: