-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build(deps): update Arrow/Parquet to 52.0
, object-store to 0.10
#10765
Conversation
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Okay, the one last thing is to update Cargo.toml after arrow |
Epic! |
Signed-off-by: Ruihang Xia <[email protected]>
@waynexia It looks like you need to run |
common::collect(merged_aggregate.execute(0, task_ctx.clone())?).await?; | ||
// enlarge memory limit in spill mode | ||
let task_ctx = if spill { | ||
new_spill_ctx(2, 2600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be configurable? Also could you add a comment explaining what is happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya could you also review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change has two stages to my understanding. The first partial aggr stage has a smaller limit, to meet the expectation that the partial plan has early emit.
datafusion/datafusion/physical-plan/src/aggregates/mod.rs
Lines 1504 to 1516 in 6846513
let expected = if spill { | |
vec![ | |
"+---+---------------+-------------+", | |
"| a | AVG(b)[count] | AVG(b)[sum] |", | |
"+---+---------------+-------------+", | |
"| 2 | 1 | 1.0 |", | |
"| 2 | 1 | 1.0 |", | |
"| 3 | 1 | 2.0 |", | |
"| 3 | 2 | 5.0 |", | |
"| 4 | 3 | 11.0 |", | |
"+---+---------------+-------------+", | |
] | |
} else { |
And the new lines enlarge the limit after that "early emit" check. Otherwise the merge aggr would fall because of insufficient memory. (at line 1554)
let result = common::collect(merged_aggregate.execute(0, task_ctx)?).await?;
The root cause of this change is somehow the memory requirement becomes larger. From my investigation, the biggest change compared to the previous is from the RawTable
in GroupValuesPrimitive
-- it needs about 1000 bytes more:
datafusion/datafusion/physical-plan/src/aggregates/group_values/primitive.rs
Lines 81 to 95 in 6846513
pub struct GroupValuesPrimitive<T: ArrowPrimitiveType> { | |
/// The data type of the output array | |
data_type: DataType, | |
/// Stores the group index based on the hash of its value | |
/// | |
/// We don't store the hashes as hashing fixed width primitives | |
/// is fast enough for this not to benefit performance | |
map: RawTable<usize>, | |
/// The group index of the null value if any | |
null_group: Option<usize>, | |
/// The values for each group index | |
values: Vec<T::Native>, | |
/// The random state used to generate hashes | |
random_state: RandomState, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't realized that these literal values were just in tests ... thanks for the explanation
@@ -2032,7 +2037,7 @@ mod tests { | |||
spill: bool, | |||
) -> Result<()> { | |||
let task_ctx = if spill { | |||
new_spill_ctx(2, 3200) | |||
new_spill_ctx(2, 4200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, seems like we need to make something configurable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding, do you mean the change in 3a33755 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I would like to understand the aggregate spilling changes more.
Thanks @waynexia
I merged up from main and updated the lock file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waynexia -- I agree with @andygrove 's concern about the changes to spilling -- I think we should address them before merging but otherwise this looks really nice to me
🙏
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Thanks for reviewing! |
@waynexia I was inspecting the dependency tree for delta-rs and it seems the dependency didn't get bumped for object-store in
|
Sorry for causing this! Opened #10848 to fix it. |
All good, just wanted to bring it to your attention before 39 formally drops :) |
FYI @andygrove -- maybe this means we should make another RC for 39.0.0 🤔 We could also potentially make a 39.1.0 release too with this dependency upgrade fix #10848 |
…pache#10765) * fix compile on default feature config Signed-off-by: Ruihang Xia <[email protected]> * fix test of common, functions, optimizer and physical-expr Signed-off-by: Ruihang Xia <[email protected]> * fix other tests Signed-off-by: Ruihang Xia <[email protected]> * fix one last test Signed-off-by: Ruihang Xia <[email protected]> * fix clippy warnings Signed-off-by: Ruihang Xia <[email protected]> * fix datafusion-cli Signed-off-by: Ruihang Xia <[email protected]> * switch to git deps Signed-off-by: Ruihang Xia <[email protected]> * regen proto file Signed-off-by: Ruihang Xia <[email protected]> * fix pyo3 feature Signed-off-by: Ruihang Xia <[email protected]> * fix slt Signed-off-by: Ruihang Xia <[email protected]> * fix symmetric hash join cases Signed-off-by: Ruihang Xia <[email protected]> * update integration result Signed-off-by: Ruihang Xia <[email protected]> * fix up spill test Signed-off-by: Ruihang Xia <[email protected]> * shift to the released packages Signed-off-by: Ruihang Xia <[email protected]> * Update cargo.lock * Update datafusion/optimizer/src/analyzer/type_coercion.rs Co-authored-by: Andrew Lamb <[email protected]> * update document Signed-off-by: Ruihang Xia <[email protected]> * move memory limit to parameter pos Signed-off-by: Ruihang Xia <[email protected]> --------- Signed-off-by: Ruihang Xia <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #.
Rationale for this change
51.0.0
, tonic to0.11
#961352.0.0
arrow-rs#5688What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?