Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(7181): provide slicing of CursorValues #7912

Closed
wants to merge 6 commits into from

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Oct 24, 2023

Which issue does this PR close?

Part of #7181 .

Rationale for this change

In the cascaded merge tree, partially sorted record batches are yielded per merge node.
Therefore, we need the ability to yield a partial of these representative values (a.k.a. the CursorValues).

What changes are included in this PR?

Slicing the CursorValues.
Add tests to demonstrate use with Cursor::new(sliced_cursor_values).
Have test coverage include different data types.

Are these changes tested?

Yes, tests are added.

Are there any user-facing changes?

No. CursorValues are currently an internal construct.

@wiedld wiedld force-pushed the 7181/slice-cursor-values branch from cdcdf6f to 6cd1b82 Compare October 24, 2023 02:12
@wiedld wiedld marked this pull request as ready for review October 24, 2023 03:53
@alamb
Copy link
Contributor

alamb commented Oct 25, 2023

I plan to review this tomorrow morning

@wiedld
Copy link
Contributor Author

wiedld commented Oct 26, 2023

For any questions on the big picture design, referred to the diagrams here.

Note that^^ is a drafted followup PR.
I've tried to incorporate some of the naming/wordage changes suggested, but other names have not changed yet (since I'm not sure how the design diagrams may impact the recommended wordage).

@tustvold
Copy link
Contributor

I wonder if you've thought about storing the slice in the BatchCursor, as opposed to pushing it down to this level. That way the logic would only exist once, as opposed to once for every CursorValues? https://github.com/apache/arrow-datafusion/pull/7842/files#diff-1eb407bf7bf6bcc6115e66c6a0197bba129c90328b9d6cd00a5e1eb5e74daa85R103

It could be a stupid idea though...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I think this looks really nice. It is both well documented as well as very thoroughly tested.

I have some suggestions on improvements to the documentation and tests that I think we should consider prior to merging, but I don't think they are strictly necessary which is why I am approving now.

Thanks again

}

#[test]
fn test_slice_primitive_nulls_first() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for also testing with nulls_first / nulls_last? I didn't see that the semantics of slice differ based on the that setting and thus I don't think these tests add much/any new coverage

I suggest either making a loop in test_slice_primitive that uses the different SortOptions ensuring they produce the same result or just removing these two tests

// hello == hello
assert_eq!(a.cmp(&b), Ordering::Equal);

// advance A. slice B full length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for these comments 👨‍🍳 👌

@tustvold
Copy link
Contributor

We should probably confirm this doesn't regress the current sort performance, e.g. by introducing an additional memory indirection when comparing cursors, but otherwise if @alamb is happy lets just move forwards with this

@wiedld
Copy link
Contributor Author

wiedld commented Oct 28, 2023

Run with the application of this slicing code (a single composable merge node) branch here.

gcp c3d-standard-8-lssd debian-11

Screenshot 2023-10-27 at 8 14 50 PM

There are further confirmation steps, as well as hypotheses, as to why this^^ is found. I'll post updates after.

@alamb alamb marked this pull request as draft November 1, 2023 19:01
@alamb
Copy link
Contributor

alamb commented Nov 1, 2023

I think we are waiting on some more measurement confirmation that this PR doesn't regress performance. Marking it as draft as we do that so we don't accidentally merge this

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 25, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Apr 29, 2024

I think we should close this @alamb , since it's not a priority at the moment. And whenever we circle back, there will be a very large diff (due to changes). Ok to close?

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Apr 30, 2024
@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

Always good with me to close PRs (we can always reopen them later if needed)

Thanks @wiedld

@alamb alamb closed this Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants