Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::hash::Hash documentation should suggest that the hash data should be prefix-free #89429

Closed
kpreid opened this issue Oct 1, 2021 · 20 comments · Fixed by #89438
Closed

std::hash::Hash documentation should suggest that the hash data should be prefix-free #89429

kpreid opened this issue Oct 1, 2021 · 20 comments · Fixed by #89438
Assignees
Labels
A-docs Area: Documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@kpreid
Copy link
Contributor

kpreid commented Oct 1, 2021

As discussed at Why does str.hash(…) pass an extra byte to the Hasher?: the code of impl Hash for &str specifically passes an extra 0xFF byte to the Hasher, so that values like ("ab", "c") and ("a", "bc") hash differently (added in 6066118).

This is a subtle property of hashing and should probably be mentioned in the documentation for Hashparticularly as the documentation for Hasher already says “you cannot assume, for example, that a write_u32 call is equivalent to four calls of write_u8” which could be sloppily interpreted as an expectation that a good Hasher implementation will handle “quoting” of sequential calls itself.

If an implementor of Hash fails to have this property, it could compromise the hash-DoS resistance that Rust tries to offer by default.

@rustbot labels: +A-docs +T-libs-api +C-enhancement

@fosskers
Copy link

fosskers commented Oct 1, 2021

Should hand-implementors of Hash instances keep this in mind as well? For instance, for the following struct:

struct User {
    name: String,
    age: u32,
    cool: bool,
}

rust-analyzer can auto-generate the following implementation (and I assume #[derive(Hash)] does the same):

impl Hash for User {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.name.hash(state);
        self.age.hash(state);
        self.cool.hash(state);
    }
}

There are no magic bytes here, as in str. Is that okay? Do we run the risk of this hashing to the same value as (String, u32, bool)? (Actually I will test this, one moment...)

@fosskers
Copy link

fosskers commented Oct 1, 2021

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

#[derive(Hash)]
struct User {
    name: String,
    age: u32,
    cool: bool,
}

struct Boozer {
    name: String,
    age: u32,
    cool: bool,
}

impl Hash for Boozer {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.name.hash(state);
        self.age.hash(state);
        self.cool.hash(state);
    }
}

fn hash<H: Hash>(h: H) {
    let mut hasher = DefaultHasher::new();
    h.hash(&mut hasher);
    println!("{}", hasher.finish());
}

fn main() {
    let user = User {
        name: "Jack".to_string(),
        age: 8,
        cool: true,
    };
    let boozer = Boozer {
        name: "Jack".to_string(),
        age: 8,
        cool: true,
    };
    let tuple = ("Jack".to_string(), 8, true);

    hash(user);
    hash(boozer);
    hash(tuple);
}

The results are:

#+RESULTS:
: 10970795962411125193
: 10970795962411125193
: 10970795962411125193

Is this expected?

@tczajka
Copy link

tczajka commented Oct 1, 2021

Is that okay? Do we run the risk of this hashing to the same value as (String, u32, bool)? (Actually I will test this, one moment...)

Different types hashing to the same value are not a problem. What is a problem is different (unequal) values of the same type hashing to the same value (for all Hashers).

@pierwill
Copy link
Member

pierwill commented Oct 1, 2021

@rustbot claim

@cuviper
Copy link
Member

cuviper commented Oct 1, 2021

Different types hashing to the same value are not a problem. What is a problem is different (unequal) values of the same type hashing to the same value (for all Hashers).

Collisions are inevitable, and it shouldn't matter between Hashers for the same reason it doesn't for different types.

I do agree that types should try to avoid collisions, as long as we make this kind of thing a suggestion of best practice, not stated as a hard requirement of the trait.

@tczajka
Copy link

tczajka commented Oct 1, 2021

Collisions are inevitable

Collisions in Hasher are inevitable. Collisions in Hash are not inevitable.

One benefit of not creating collisions in Hash is that it then is possible to implement a randomized Hasher with mathematically bounded small collision probability for any two unequal inputs (using universal hashing). If there is a deterministic collision in Hash, then it becomes impossible, because you have 100% collision probability for that pair of inputs.

@cuviper
Copy link
Member

cuviper commented Oct 1, 2021

OK, that's a compelling distinction between the responsibilities of Hash and Hasher.

Is it always possible to do this perfectly on the Hash side? I suppose it must be -- if there's a feature that makes values different from an Eq perspective, that is an input that should also be used for hashing.

@kpreid
Copy link
Contributor Author

kpreid commented Oct 1, 2021

[Trying again to set labels since I typoed the first one:]
@rustbot label: +A-docs +T-libs-api +C-enhancement

@rustbot rustbot added A-docs Area: Documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Oct 1, 2021
@tczajka
Copy link

tczajka commented Oct 1, 2021

Is it always possible to do this perfectly on the Hash side? I suppose it must be -- if there's a distinction makes them different from an Eq perspective, that is an input that should also be used for hashing.

I am not sure it's always possible, in theory you could imagine some contrived Eq algorithm that makes it computationally expensive to Hash in a way fully consistent with it, but I can't think of a practical example.

I think it would be fine to say that if can't come up with a proper Hash algorithm that has this property, you shouldn't implement Hash (perhaps you'd be better served by BTreeMap in that case).

@pierwill
Copy link
Member

pierwill commented Oct 1, 2021

For the docs, would it make sense to discuss the case of impl Hash for &str in a new paragraph after this paragraph:

/// This trait makes no assumptions about how the various `write_*` methods are
/// defined and implementations of [`Hash`] should not assume that they work one
/// way or another. You cannot assume, for example, that a [`write_u32`] call is
/// equivalent to four calls of [`write_u8`].
?

The paragraph could start with something like:

For example, the code for impl Hash for &str passes an extra 0xFF byte to the Hasher, so that values like ("ab", "c") and ("a", "bc") hash differently.

@danielhenrymantilla
Copy link
Contributor

Let's then note that BTreeMap Hash is thus currently "bugged" in that regard (c.f. my snippet in the URLO thread).

@pierwill
Copy link
Member

pierwill commented Oct 1, 2021

I gave this a try! I might not have understood correctly, however...

@scottmcm
Copy link
Member

scottmcm commented Oct 1, 2021

FWIW, being prefix-free can also be unfortunate sometimes. For example, (T, T) is more efficient to hash than [T; 2] because the latter hashes the length to match the hash of the corresponding [T] -- and slices do that so it's prefix-free.

I don't know that there's a great answer for this, though...

@cuviper
Copy link
Member

cuviper commented Oct 1, 2021

Let's then note that BTreeMap Hash is thus currently "bugged" in that regard (c.f. my snippet in the URLO thread).

See #89443.

@tczajka
Copy link

tczajka commented Oct 1, 2021

For example, (T, T) is more efficient to hash than [T; 2] because the latter hashes the length to match the hash of the corresponding [T] -- and slices do that so it's prefix-free.

Is there a real reason [T; 2] and [T] of length 2 have to hash to the same thing? I don't think it's that easy to accidentally coerce them to a wrong thing when hashing (surely HashMap doesn't do that).

In any case, slices are already prefix-free (as long as T is) (and I think that's good, otherwise you'd have a lot of collisions on different short slices), so this issue doesn't change that.

@cuviper
Copy link
Member

cuviper commented Oct 1, 2021

Is there a real reason [T; 2] and [T] of length 2 have to hash to the same thing?

Yes, Borrow requires it. You can have a HashMap with array keys and then do lookups with a slice.

@m-ou-se
Copy link
Member

m-ou-se commented Oct 3, 2021

Nominated this for @rust-lang/libs-api discussion.

@scottmcm
Copy link
Member

scottmcm commented Oct 4, 2021

@tczajka I added a comment to the impl in #86140 after I forget why it was that way too :)

@Amanieu
Copy link
Member

Amanieu commented Oct 6, 2021

One possibility would be to leave the choice of whether to add a prefix to the Hasher via some form of hint: performance-oriented hashers would opt to skip the prefix while security-oriented ones would opt to keep it.

This could be done with an optional method on the Hasher trait:

trait Hasher {
    // Option 1
    fn should_add_prefix(&self) -> bool { false }
    
    // Option 2
    fn write_prefix(&mut self, prefix: usize) { self.write_usize(prefix); }
}

@tczajka
Copy link

tczajka commented Oct 6, 2021

performance-oriented hashers would opt to skip the prefix

I dispute the idea that making the Hash encoding non-prefix-free is a performance optimization, for the same reason an empty hash implementation that does nothing would not really be a performance optimization.

Potential huge cost of resulting hash collisions (for certain kinds of data) is likely to greatly overwhelm any such (tiny) savings from lazy hashing.

I think of this as a bug because it defeats the assumptions of mathematical analysis of the performance properties of hash tables with certain hashers.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Oct 10, 2021
docs: `std::hash::Hash` should ensure prefix-free data

Attempt to synthesize the discussion in rust-lang#89429 into a suggestion regarding `Hash` implementations (not a hard requirement).

Closes rust-lang#89429.
@bors bors closed this as completed in f531b81 Oct 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: Documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants