-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle case when file size is too small for sample size #16
Conversation
This fixes the case where hashing may crash when using custom parameters if the sample size is too large for the file. - Update the spec and code to correctly constraint mode based on file size - Update tests - Minor lint fixes
I might just be missing something or not understand the algorithm correctly, but I'm not sure I quite understand where |
Oh wait, I think I realize my mistake. I confused myself by thinking the "middle" read would start directly after the "start" read in the case of |
@Hakkin You raise a good point, and funnily enough I noticed Since this PR effectively defines new behavior anyway, I'm coming around to the idea of |
As alternative, one could require that |
I want to maintain stability as much as possible for the well-defined part and keep the selection pattern the same. The change now is defining previously undefined behavior, so there is a little more latitude. BTW, this topic will also be revisited when I implement a user-defined number of chunks in the future. For now, I like the cutoff being: I've updated this PR with the change. And will get this released as v1.1.0. Thank you @hiql for the prompt update. Would you mind changing the cutoff and test vectors one more time? 🙏 |
Issue
Behavior can be unpredictable if the file size is less than ~2x the sample size and sampling is used. This could include hash values that differ between imohash implementations, or even crashing (see #15).
Affected Users
This affects users who:
sample_size
>2*sample_threshold
sample_threshold
-->2*sample_size
Analysis
imohash samples at the beginning, middle, and end of the file. If the sample size is
s
, the file needs to be at least2s-1
to sample the middle without hitting EOF. Furthermore, ifs
is larger than the whole file, then there can be a seek() error when trying to samples
back from the end of the file.The original spec didn't correctly address this. There is a check described relating the sample size and sample threshold, but it doesn't really make sense. What matters is the actual file size and the sample size. Furthermore, the check was never implemented anyway!
Since the default sample size is 0.125 of the default threshold, the problem condition is never hit when using defaults.
Fix
The spec and code have been updated to check for sample size relative to file size and choose the correct mode.
If you're using custom parameters in the range described above and saving the hash, the hash post-upgrade will differ for files within the affected size range. In practice, this is a rare case. Most uses are for synchronization, and the hashes are ephemeral. Technically, this is a breaking change. However, given the narrow nature of this issue, I'm merely going to v1.1.0 with it and not a full v2.
Fixes #15