`std::hash::Hash` documentation should suggest that the hash data should be prefix-free #89429

kpreid · 2021-10-01T15:01:16Z

As discussed at Why does str.hash(…) pass an extra byte to the Hasher?: the code of impl Hash for &str specifically passes an extra 0xFF byte to the Hasher, so that values like ("ab", "c") and ("a", "bc") hash differently (added in 6066118).

This is a subtle property of hashing and should probably be mentioned in the documentation for Hash — particularly as the documentation for Hasher already says “you cannot assume, for example, that a write_u32 call is equivalent to four calls of write_u8” which could be sloppily interpreted as an expectation that a good Hasher implementation will handle “quoting” of sequential calls itself.

If an implementor of Hash fails to have this property, it could compromise the hash-DoS resistance that Rust tries to offer by default.

@rustbot labels: +A-docs +T-libs-api +C-enhancement

The text was updated successfully, but these errors were encountered:

fosskers · 2021-10-01T15:49:33Z

Should hand-implementors of Hash instances keep this in mind as well? For instance, for the following struct:

struct User {
    name: String,
    age: u32,
    cool: bool,
}

rust-analyzer can auto-generate the following implementation (and I assume #[derive(Hash)] does the same):

impl Hash for User {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.name.hash(state);
        self.age.hash(state);
        self.cool.hash(state);
    }
}

There are no magic bytes here, as in str. Is that okay? Do we run the risk of this hashing to the same value as (String, u32, bool)? (Actually I will test this, one moment...)

fosskers · 2021-10-01T16:05:07Z

use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

#[derive(Hash)]
struct User {
    name: String,
    age: u32,
    cool: bool,
}

struct Boozer {
    name: String,
    age: u32,
    cool: bool,
}

impl Hash for Boozer {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.name.hash(state);
        self.age.hash(state);
        self.cool.hash(state);
    }
}

fn hash<H: Hash>(h: H) {
    let mut hasher = DefaultHasher::new();
    h.hash(&mut hasher);
    println!("{}", hasher.finish());
}

fn main() {
    let user = User {
        name: "Jack".to_string(),
        age: 8,
        cool: true,
    };
    let boozer = Boozer {
        name: "Jack".to_string(),
        age: 8,
        cool: true,
    };
    let tuple = ("Jack".to_string(), 8, true);

    hash(user);
    hash(boozer);
    hash(tuple);
}

The results are:

#+RESULTS:
: 10970795962411125193
: 10970795962411125193
: 10970795962411125193

Is this expected?

tczajka · 2021-10-01T16:11:56Z

Is that okay? Do we run the risk of this hashing to the same value as (String, u32, bool)? (Actually I will test this, one moment...)

Different types hashing to the same value are not a problem. What is a problem is different (unequal) values of the same type hashing to the same value (for all Hashers).

pierwill · 2021-10-01T16:24:54Z

@rustbot claim

cuviper · 2021-10-01T16:27:42Z

Different types hashing to the same value are not a problem. What is a problem is different (unequal) values of the same type hashing to the same value (for all Hashers).

Collisions are inevitable, and it shouldn't matter between Hashers for the same reason it doesn't for different types.

I do agree that types should try to avoid collisions, as long as we make this kind of thing a suggestion of best practice, not stated as a hard requirement of the trait.

tczajka · 2021-10-01T16:32:58Z

Collisions are inevitable

Collisions in Hasher are inevitable. Collisions in Hash are not inevitable.

One benefit of not creating collisions in Hash is that it then is possible to implement a randomized Hasher with mathematically bounded small collision probability for any two unequal inputs (using universal hashing). If there is a deterministic collision in Hash, then it becomes impossible, because you have 100% collision probability for that pair of inputs.

cuviper · 2021-10-01T16:40:27Z

OK, that's a compelling distinction between the responsibilities of Hash and Hasher.

Is it always possible to do this perfectly on the Hash side? I suppose it must be -- if there's a feature that makes values different from an Eq perspective, that is an input that should also be used for hashing.

kpreid · 2021-10-01T16:41:26Z

[Trying again to set labels since I typoed the first one:]
@rustbot label: +A-docs +T-libs-api +C-enhancement

tczajka · 2021-10-01T16:44:59Z

Is it always possible to do this perfectly on the Hash side? I suppose it must be -- if there's a distinction makes them different from an Eq perspective, that is an input that should also be used for hashing.

I am not sure it's always possible, in theory you could imagine some contrived Eq algorithm that makes it computationally expensive to Hash in a way fully consistent with it, but I can't think of a practical example.

I think it would be fine to say that if can't come up with a proper Hash algorithm that has this property, you shouldn't implement Hash (perhaps you'd be better served by BTreeMap in that case).

pierwill · 2021-10-01T16:47:34Z

For the docs, would it make sense to discuss the case of impl Hash for &str in a new paragraph after this paragraph:

rust/library/core/src/hash/mod.rs

Lines 246 to 249 in ed93759

    
           /// This trait makes no assumptions about how the various `write_*` methods are 
        
           /// defined and implementations of [`Hash`] should not assume that they work one 
        
           /// way or another. You cannot assume, for example, that a [`write_u32`] call is 
        
           /// equivalent to four calls of [`write_u8`].

?

The paragraph could start with something like:

For example, the code for impl Hash for &str passes an extra 0xFF byte to the Hasher, so that values like ("ab", "c") and ("a", "bc") hash differently.

danielhenrymantilla · 2021-10-01T16:55:31Z

Let's then note that BTreeMap Hash is thus currently "bugged" in that regard (c.f. my snippet in the URLO thread).

pierwill · 2021-10-01T17:47:01Z

I gave this a try! I might not have understood correctly, however...

scottmcm · 2021-10-01T19:02:04Z

FWIW, being prefix-free can also be unfortunate sometimes. For example, (T, T) is more efficient to hash than [T; 2] because the latter hashes the length to match the hash of the corresponding [T] -- and slices do that so it's prefix-free.

I don't know that there's a great answer for this, though...

cuviper · 2021-10-01T19:30:40Z

Let's then note that BTreeMap Hash is thus currently "bugged" in that regard (c.f. my snippet in the URLO thread).

See #89443.

tczajka · 2021-10-01T19:38:30Z

For example, (T, T) is more efficient to hash than [T; 2] because the latter hashes the length to match the hash of the corresponding [T] -- and slices do that so it's prefix-free.

Is there a real reason [T; 2] and [T] of length 2 have to hash to the same thing? I don't think it's that easy to accidentally coerce them to a wrong thing when hashing (surely HashMap doesn't do that).

In any case, slices are already prefix-free (as long as T is) (and I think that's good, otherwise you'd have a lot of collisions on different short slices), so this issue doesn't change that.

cuviper · 2021-10-01T19:39:29Z

Is there a real reason [T; 2] and [T] of length 2 have to hash to the same thing?

Yes, Borrow requires it. You can have a HashMap with array keys and then do lookups with a slice.

m-ou-se · 2021-10-03T14:14:04Z

Nominated this for @rust-lang/libs-api discussion.

scottmcm · 2021-10-04T03:26:45Z

@tczajka I added a comment to the impl in #86140 after I forget why it was that way too :)

Amanieu · 2021-10-06T11:23:45Z

One possibility would be to leave the choice of whether to add a prefix to the Hasher via some form of hint: performance-oriented hashers would opt to skip the prefix while security-oriented ones would opt to keep it.

This could be done with an optional method on the Hasher trait:

trait Hasher {
    // Option 1
    fn should_add_prefix(&self) -> bool { false }
    
    // Option 2
    fn write_prefix(&mut self, prefix: usize) { self.write_usize(prefix); }
}

tczajka · 2021-10-06T12:31:54Z

performance-oriented hashers would opt to skip the prefix

I dispute the idea that making the Hash encoding non-prefix-free is a performance optimization, for the same reason an empty hash implementation that does nothing would not really be a performance optimization.

Potential huge cost of resulting hash collisions (for certain kinds of data) is likely to greatly overwhelm any such (tiny) savings from lazy hashing.

I think of this as a bug because it defeats the assumptions of mathematical analysis of the performance properties of hash tables with certain hashers.

docs: `std::hash::Hash` should ensure prefix-free data Attempt to synthesize the discussion in rust-lang#89429 into a suggestion regarding `Hash` implementations (not a hard requirement). Closes rust-lang#89429.

rustbot assigned pierwill Oct 1, 2021

pierwill mentioned this issue Oct 1, 2021

docs: std::hash::Hash should ensure prefix-free data #89438

Merged

m-ou-se added the I-nominated label Oct 3, 2021

bors closed this as completed in f531b81 Oct 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`std::hash::Hash` documentation should suggest that the hash data should be prefix-free #89429

`std::hash::Hash` documentation should suggest that the hash data should be prefix-free #89429

kpreid commented Oct 1, 2021 •

edited by rustbot

Loading

fosskers commented Oct 1, 2021 •

edited

Loading

fosskers commented Oct 1, 2021 •

edited

Loading

tczajka commented Oct 1, 2021

pierwill commented Oct 1, 2021

cuviper commented Oct 1, 2021

tczajka commented Oct 1, 2021

cuviper commented Oct 1, 2021 •

edited

Loading

kpreid commented Oct 1, 2021

tczajka commented Oct 1, 2021 •

edited

Loading

pierwill commented Oct 1, 2021 •

edited

Loading

danielhenrymantilla commented Oct 1, 2021

pierwill commented Oct 1, 2021

scottmcm commented Oct 1, 2021

cuviper commented Oct 1, 2021

tczajka commented Oct 1, 2021

cuviper commented Oct 1, 2021 •

edited

Loading

m-ou-se commented Oct 3, 2021

scottmcm commented Oct 4, 2021 •

edited

Loading

Amanieu commented Oct 6, 2021

tczajka commented Oct 6, 2021

std::hash::Hash documentation should suggest that the hash data should be prefix-free #89429

std::hash::Hash documentation should suggest that the hash data should be prefix-free #89429

Comments

kpreid commented Oct 1, 2021 • edited by rustbot Loading

fosskers commented Oct 1, 2021 • edited Loading

fosskers commented Oct 1, 2021 • edited Loading

tczajka commented Oct 1, 2021

pierwill commented Oct 1, 2021

cuviper commented Oct 1, 2021

tczajka commented Oct 1, 2021

cuviper commented Oct 1, 2021 • edited Loading

kpreid commented Oct 1, 2021

tczajka commented Oct 1, 2021 • edited Loading

pierwill commented Oct 1, 2021 • edited Loading

danielhenrymantilla commented Oct 1, 2021

pierwill commented Oct 1, 2021

scottmcm commented Oct 1, 2021

cuviper commented Oct 1, 2021

tczajka commented Oct 1, 2021

cuviper commented Oct 1, 2021 • edited Loading

m-ou-se commented Oct 3, 2021

scottmcm commented Oct 4, 2021 • edited Loading

Amanieu commented Oct 6, 2021

tczajka commented Oct 6, 2021

`std::hash::Hash` documentation should suggest that the hash data should be prefix-free #89429

`std::hash::Hash` documentation should suggest that the hash data should be prefix-free #89429

kpreid commented Oct 1, 2021 •

edited by rustbot

Loading

fosskers commented Oct 1, 2021 •

edited

Loading

fosskers commented Oct 1, 2021 •

edited

Loading

cuviper commented Oct 1, 2021 •

edited

Loading

tczajka commented Oct 1, 2021 •

edited

Loading

pierwill commented Oct 1, 2021 •

edited

Loading

cuviper commented Oct 1, 2021 •

edited

Loading

scottmcm commented Oct 4, 2021 •

edited

Loading