Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose vectorized hash table that work with Arrays #6996

Open
rluvaton opened this issue Jan 19, 2025 · 3 comments
Open

Expose vectorized hash table that work with Arrays #6996

rluvaton opened this issue Jan 19, 2025 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@rluvaton
Copy link
Contributor

I just read the Photon paper from 2022 and saw their vectorized implementation for hash table, I also noticed that someone opened an issue in DataFusion apache/datafusion#7095 for implementing it for group aggregate

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to have HashSet/HashMap that would support all Hash table functionality but with Arrays as input.

Problem:
DataFusion has array_agg with distinct support, if you look at the implementation it just keep adding to HashSet
https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate/src/array_agg.rs#L268-L281

this works but can be improved with computing all the hashes first, and then do probing in a tight loop

Describe the solution you'd like
It would be helpful to use there and in other places a HashSet/HashMap that can

  1. Insert all values from an array
  2. Check all values in array exists and return a BooleanArray for the result
  3. Get all the values that match each key in the array

Describe alternatives you've considered
Implement it everywhere that need HashMap/HashSet or create external crate

Additional context
The way I see it there will be couple of implementation

  1. Primitive/boolean
  2. Bytes
  3. Generic that will use arrow-row

I'm willing to create a PR for that. I see it as using internally the hashbrown raw API to implement that

@rluvaton rluvaton added the enhancement Any new improvement worthy of a entry in the changelog label Jan 19, 2025
@rluvaton rluvaton changed the title exposing vectorized hash table that work with Arrays Expose vectorized hash table that work with Arrays Jan 19, 2025
@tustvold
Copy link
Contributor

tustvold commented Jan 19, 2025

Did you mean to file this in the datafusion repository? Hashtable implementations will necessarily be very tightly coupled with how a given query engine chooses to handle dataflow, and the operation being performed, I'm not sure a generic approach makes a whole lot of sense, beyond what is already provided by hashbrown::RawTable.

In fact the conclusion of the linked DF ticket appears to be that DF already implements this...

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

I think it is a very good observation that there are several structures in DataFusion that basically look like hash sets / hash maps on arrays.

let hash_table = HashMap::new(key_array.data_type(), value_array.data_type());
// returns a BooleanArray that 
let inserted = hash_table.insert(key_array, value

And there are similar things for Set. I am thinking specifically about code like

https://github.com/apache/datafusion/blob/274e5356ceb4c559ab4105478e75817a302d2f13/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs#L44-L51

and

https://github.com/apache/datafusion/blob/274e5356ceb4c559ab4105478e75817a302d2f13/datafusion/functions-aggregate-common/src/aggregate/count_distinct/bytes.rs#L30-L46

In my opinon such a primitive makes a lot of sense to consider moving upstream in arrow. The open question in my mind is exactly what the API would look like

Perhaps to begin with, we could begin in DataFusion refactoring the various maps / sets into an ArrowSet / ArrowMap. Once we have figured out what the API is then would be an excellent time to consider proposing upstreaming it into arrow.

@tustvold
Copy link
Contributor

tustvold commented Jan 22, 2025

The open question in my mind is exactly what the API would look like

This is the key thing IMO, I suspect either you will end up with something maximally flexible like RawTable but which is not very ergonomic, or something that's higher-level but very coupled to the operation being performed, e.g. count distinct, but which isn't applicable to aggregates in general, or can't handle multiple columns, etc...

I agree that seeing if there is a way to reduce duplication within DF would be a good first step to identify if there is a common API that can be extracted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants