-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test] add fuzz test for topk #7772
Conversation
f63c3fc
to
4144b44
Compare
@alamb Please review it. |
Thanks |
Thank you @Tangruilin -- I plan to review this tomorrow. I look forward to it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Tangruilin
TODO: This PR is not completed. I need some suggestions that if i need to modify fn batches_to_vec
I am not quite sure what you mean here. I think if you can make a TopKScenario
type structure, it maybe easier to figure out if you need to modify batches_to_vec
@@ -138,6 +162,44 @@ impl SortTest { | |||
self | |||
} | |||
|
|||
async fn run_with_params( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps naming this run_with_limit
would make it easier to understand what is happening
} | ||
|
||
#[tokio::test] | ||
async fn test_sort_topk_i32() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about encapsulating the limit data and expected value calculation in a structure?
So this test might look like
let size: usize = ...; // pick a random size
let scenario = TopKScenario::new()
// tell the scenario to sort by one column
.with_sort_column(["i32_column"])
// specify a limit of 10 rows
.with_limit(10);
// stagger the batches in the scenario
scenario.stagger()
let collected = SortTest::new()
// call Scenario::batches to get the input batches
.with_input(scenario.batches());
// run the test
.run_with_limit("t", scenario.sort_cols(), scenario.limit()).await;
// The scenario handles calculting expected output (as it knows the sort column and limit)
let expected = scenario.expected()
let actual = batches_to_vec(&collected);
assert_eq!(actual, &expected);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good suggestion! I'll do it tomorrow and push it
Mark as draft to signify this PR is not waiting on feedback anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @Tangruilin
use std::sync::Arc; | ||
use test_utils::{batches_to_vec, partitions_to_sorted_vec}; | ||
use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const KB: u64 = 1 << 10; |
for (batch_size, should_spill) in [(5, false), (20000, true), (1000000, true)] { | ||
SortTest::new() | ||
.with_int32_batches(batch_size) | ||
.with_pool_size(10240) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.with_pool_size(10240) | |
.with_pool_size(10 * KB) |
for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, true)] { | ||
SortTest::new() | ||
.with_int32_batches(batch_size) | ||
.with_pool_size(102400) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.with_pool_size(102400) | |
.with_pool_size(100 * KB) |
I run ci/scripts/rust_clippy.sh on my mac but get the result is not some, I'm confused @alamb |
You may have to do:
|
except that. I found that the rust should be stable —— run |
I think it is implicit in the Testing setup: part:
I agree this could be clearer in the documentation. Perhaps you can help here |
I have tried some ways, but there is not a prefect solution. I will update the PR tonight |
Signed-off-by: reilly <[email protected]>
4144b44
to
1920bfe
Compare
Thanks @Tangruilin -- I plan to review this more carefully tomorrow |
Don't forget this~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @Tangruilin and @Weijun-H -- this very nicely done. I am sorry for my delay in reviewing. I think this PR adds additional coverage and is structured to make future improvements straghtforward.
🙏 great job
datafusion_expr::col(topk_scenario.col_name).sort(true, true) | ||
]) | ||
.unwrap() | ||
.limit(0, Some(topk_scenario.limit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified the plan here has the expected TopK
element:
| physical_plan | GlobalLimitExec: skip=0, fetch=10 |
| | SortExec: TopK(fetch=10), expr=[x@0 ASC] |
| | MemoryExec: partitions=1, partition_sizes=[38] |
BTW I am working on an extesion of this test to support multiple columns |
Which issue does this PR close?
Closes #7749 .
Rationale for this change
Add fuzz test for topk
What changes are included in this PR?
TODO: This PR is not completed. I need some suggestions that if i need to modify
fn batches_to_vec
Are these changes tested?
this PR is a test
Are there any user-facing changes?
no