-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose vectorized hash table that work with Array
s
#6996
Comments
Array
sArray
s
Did you mean to file this in the datafusion repository? Hashtable implementations will necessarily be very tightly coupled with how a given query engine chooses to handle dataflow, and the operation being performed, I'm not sure a generic approach makes a whole lot of sense, beyond what is already provided by hashbrown::RawTable. In fact the conclusion of the linked DF ticket appears to be that DF already implements this... |
I think it is a very good observation that there are several structures in DataFusion that basically look like hash sets / hash maps on arrays. let hash_table = HashMap::new(key_array.data_type(), value_array.data_type());
// returns a BooleanArray that
let inserted = hash_table.insert(key_array, value And there are similar things for Set. I am thinking specifically about code like and In my opinon such a primitive makes a lot of sense to consider moving upstream in arrow. The open question in my mind is exactly what the API would look like Perhaps to begin with, we could begin in DataFusion refactoring the various maps / sets into an |
This is the key thing IMO, I suspect either you will end up with something maximally flexible like I agree that seeing if there is a way to reduce duplication within DF would be a good first step to identify if there is a common API that can be extracted. |
I just read the Photon paper from 2022 and saw their vectorized implementation for hash table, I also noticed that someone opened an issue in DataFusion apache/datafusion#7095 for implementing it for group aggregate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to have
HashSet
/HashMap
that would support all Hash table functionality but with Arrays as input.Problem:
DataFusion has
array_agg
with distinct support, if you look at the implementation it just keep adding toHashSet
https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate/src/array_agg.rs#L268-L281
this works but can be improved with computing all the hashes first, and then do probing in a tight loop
Describe the solution you'd like
It would be helpful to use there and in other places a HashSet/HashMap that can
Describe alternatives you've considered
Implement it everywhere that need HashMap/HashSet or create external crate
Additional context
The way I see it there will be couple of implementation
arrow-row
I'm willing to create a PR for that. I see it as using internally the hashbrown raw API to implement that
The text was updated successfully, but these errors were encountered: