feat(7181): Encapsulate cursor handling within the SortOrderBuilder. #7842

wiedld · 2023-10-16T23:07:41Z

Part of #7181

Rationale for this change

This is the 1st half of the next change, as we start building I/O streams for the merge node.

Goal in this PR to have a separation of concerns:

merge node logic is the loser tree (SortPreservingMergeStream)
sort order builder will handle the creation of sort orders (and in the future, any batch slicing & offset changes)

Goal in next PR:

have the first I/O streams for the merge nodes, so it's composable into a tree.
example commit is here -- altho it doesn't yet contain the CursorValues update, nor the slicing.

What changes are included in this PR?

rename BatchBuilder => SortOrderBuilder.
A new submodule sorts/batches.
- Handles anything specific to a record batches.
- Currently includes the BatchCursor, which will (in next PR) contain the unique BatchId and be yielded per merge node.
- In next PR, will include the BatchTracker which collects the record batches and assigns a unique BatchId, such that the cascading streams only pass around cursors.
make a skeleton SortPreservingCascadeStream. Does nothing yet.
Metrics:
- poll metrics only collected around operator poll (the cascade tree root).
- compute metric still collected in the loser tree.
Move cursor into SortOrderBuilder.
- In the future, yielding of sort orders will include cursor slicing.
- The merge node should not need to care about cursor slice. Therefore the cursor is within the sort order builder.

Are these changes tested?

Passing sort tests.
Let me know if any additional tests should be added.

Are there any user-facing changes?

No.

* CursorStream is a crate pub type * BatchCursor is now in a batch submod * Add the cascade merge skeleton, which is currently a multithread-safe wrapper around a single merge node

…cur at cascade root

…node as only the loser tree * This sets up the SortOrderBuilder as owning (and slicing) the cursors.

wiedld · 2023-10-16T23:15:02Z

datafusion/physical-plan/src/sorts/builder.rs

+        match slot.as_mut() {
+            Some(c) => {
+                if c.cursor.is_finished() {
+                    return false;


This is slightly different from the code removed from the merge node, which on finish sets *slot = None.

Instead, we do not dump the batch_cursor and instead return false prior to advancement => which results in the same boolean returned on the next time advance_cursor() is checked.

Note that this means that the lifetime/cycle of a self.cursors[stream_idx] slot is:

Option == None. For that stream_idx, no ongoing cursor exists yet.

Some(cursor) after push_batch().

several push_row()

merge/loser tree node checks that cursor is finished !cursor_in_progress()

polls for next CursorValues

push_batch()

push_batch saves the completed BatchCursor (to sorted_batches) and adds the new BatchCursor.

After the next PR, we will then see:

SortOrderBuilder::build_batch() will change to SortOrderBuilder::yield_sort_order()

on yield, it will do:

fully yielded BatchCursors will have a new (at the start) Cursor with the full CursorValues

fully retained BatchCursors will have no change. (remain in ongoing self.cursors[stream_idx])

partial yielded BatchCursor:

will slice the underlying CursorValues and have new cursors in both (sliced) parts

wiedld · 2023-10-20T19:10:22Z

datafusion/physical-plan/src/sorts/builder.rs

@@ -42,15 +37,15 @@ pub struct BatchBuilder {
    reservation: MemoryReservation,

    /// The current [`BatchCursor`] for each stream
-    cursors: Vec<BatchCursor>,
+    cursors: Vec<Option<BatchCursor<C>>>,


This is indexed per stream. As such, we don't always have an ongoing cursor per each stream => hence the option.

It seems like this structure now does some of the same things as SortPreservingMergeStream: https://github.com/apache/arrow-datafusion/blob/37d6bf08c948418fe6c72d072d988c2875d81e02/datafusion/physical-plan/src/sorts/merge.rs#L210-L222

Is there any way to avoid a second copy?

There is no second copy. (That one is deleted.)

All of the cursor management/ownership is moved into the SortOrderBuilder, and removed from the SortPreservingMergeStream. The division of concerns is:

SortPreservingMergeStream == only care about merging (via loser tree)

SortOrderBuilder == holds the cursor (being advanced), and the sort_order (being built).

This clearly demonstrates why I need to add more docs to the BatchCursor. 😅 Doing so.

wiedld · 2023-10-20T19:32:35Z

datafusion/physical-plan/src/sorts/batches.rs

+pub(crate) struct BatchCursor<C: CursorValues> {
+    /// The index into SortOrderBuilder::batches
+    /// TODO: this will become a BatchId, for record batch collected (and not passed across streams)
+    pub batch_idx: usize,


Note: the BatchId is required for the batch tracking. Therefore, we have to yield something (beyond just the CursorValues) which contains this tracking id.

I decided to have the same abstraction (a.k.a. BatchCursor) be used for both batches currently being sorted (in a merge/loser tree), and the batches being yielded to the next merge/loser tree node (in the cascade tree). But we could revisit this design.

alamb

I spent a good while studying this code this morning. Thank you @wiedld

The code on main I think looks something like:

Original RecordBatches with input rows are stored in by BatchBuilder
The Cursors (not BatchCursor!) are stored in SortPreservingMergeStream
The SortPreservingMergeStream manages the cursors and the BatchBuilder manages creating output rows.

This PR somewhat combines these concerns: SortOrderBuilder now has Cursors (but they are also still in the SortPreservingMergeStream).

I think I understand the need to keep the RecordBatch alongside the cursors as they pass through the merge tree. However keeping two (possibly more) structures with the same RecordBatch synchronized during execution seems like it will be fraught with potential hard to debug errors.

Rather than duplicate the logic, would it be possible to transfer the RecordBatch ownership entirely to the Cursors? The BatchBuilder would still need to track the output being built (batch_id, row_idx) but the batches could come from the cursors themselves 🤔

Musings

As I was reading this PR I was reflecting on why it has taken so long to review. One reason I think is that I don't have a high level understanding of the new concepts that are being introduced (e.g. a BatchCursor) and how they interact in the new design. Thus while reviewing this PR I am both trying to understand the code, but also reverse engineer the larger design.

For example, it took a while to understand the role a BatchCursor is supposed to play (I can read the code and see what it is doing but not the why it is doing do). To your credit there is some version of it here https://github.com/apache/arrow-datafusion/pull/7379/files#diff-f97f96eb27e6344b0d9de91d7eceb98f2cc4f2099843673270d3698ebbbad4abR43-R97 but I am still struggling to understand it for some reason (maybe the diagram could include the new structures and how they are related 🤔 I am just brainstorming)

alamb · 2023-10-21T10:37:53Z

datafusion/physical-plan/src/sorts/batches.rs

+    pub batch_idx: usize,
+
+    /// The row index within the given batch
+    pub row_idx: usize,


Given the Cursor already has a row offset, why is there a another row_idx index in BatchCursor? Is it the same or does it potentially point at a different offset?

This is existing tech debt, as the row_idx already exists (on main) in order to handle cursor advancement beyond row_idx.

This property will be removed in future PRs. I added a code comment to reflect this^^.

alamb · 2023-10-21T10:39:43Z

datafusion/physical-plan/src/sorts/batches.rs

+use super::cursor::{Cursor, CursorValues};
+
+#[derive(Debug)]
+pub(crate) struct BatchCursor<C: CursorValues> {


Perhaps we can add a doc comment here explaining what this structure is for. It seems like the core function is to hold the entire original RecordBatch that the input rows came from (rather than just the columns that are part of the sort key).

It holds partial record batches, once the slicing occurs in the future.

Per your excellent suggestion, I've added documentation to make apparent the goals of each structure. Hopefully, this helps for design discussions.

alamb · 2023-10-21T11:00:46Z

datafusion/physical-plan/src/sorts/builder.rs

@@ -42,15 +37,15 @@ pub struct BatchBuilder {
    reservation: MemoryReservation,

    /// The current [`BatchCursor`] for each stream
-    cursors: Vec<BatchCursor>,
+    cursors: Vec<Option<BatchCursor<C>>>,


It seems like this structure now does some of the same things as SortPreservingMergeStream: https://github.com/apache/arrow-datafusion/blob/37d6bf08c948418fe6c72d072d988c2875d81e02/datafusion/physical-plan/src/sorts/merge.rs#L210-L222

Is there any way to avoid a second copy?

…nted yield_sort_order()

wiedld · 2023-10-23T17:50:45Z

datafusion/physical-plan/src/sorts/builder.rs

+    ///
+    #[allow(dead_code)]
+    pub fn yield_sort_order(&mut self) -> Result<Option<YieldedSortOrder<C>>> {
+        unimplemented!("to implement in future PR");


By having the SortOrderBuilder take full ownership of the cursors, then we can have all batch partials (and awareness of partials) be handled in the SortOrderBuilder. This yield_sort_order() will have an implementation similar to this.

The loser/merge tree will only interface with the cursor through the SortOrderBuilder (e.g.push_batch(), push_row(), advance_cursor()). Please let me know if there is a better design option. 😄

tustvold · 2023-10-23T17:53:46Z

datafusion/physical-plan/src/sorts/builder.rs


 /// Provides an API to incrementally build a [`RecordBatch`] from partitioned [`RecordBatch`]
 #[derive(Debug)]
-pub struct BatchBuilder {
+pub struct SortOrderBuilder<C: CursorValues> {


Is the eventual plan to remove batches from this? It seems a little peculiar as a construction as it stands

Yes, exactly. I just pushed a commit with documentation -- to show where this was going.

I was trying to break up the change into 2 parts (see PR description). That decision seems to have created confusion. 😅

I wonder if this should instead be called CursorInterleave or something, as I believe that is the operation it is actually performing?

That decision seems to have created confusion

Yeah perhaps you could roll that into this PR, whilst it will be larger, I think it will be easier to review

tustvold · 2023-10-23T17:58:01Z

datafusion/physical-plan/src/sorts/stream.rs

@@ -48,6 +48,10 @@ pub trait PartitionedStream: std::fmt::Debug + Send {
    ) -> Poll<Option<Self::Output>>;
 }

+/// A fallible [`PartitionedStream`] of [`Cursor`](super::cursor::Cursor) and [`RecordBatch`]
+pub(crate) type CursorStream<C> =
+    Box<dyn PartitionedStream<Output = Result<(C, RecordBatch)>>>;


Is the idea this will eventually just be Box<dyn PartitionedStream<Output = Result<C>>> with the RecordBatch handled separately?

tustvold · 2023-10-23T18:00:41Z

datafusion/physical-plan/src/sorts/batches.rs

+///              ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+///             │
+///             ▼
+///        BatchCursors


I'm not sure of a better name, but to me BatchCursor would suggest this construction owns a RecordBatch and is a cursor over it, which isn't the case. Perhaps we might also be able to work the slice aspect in there, as I think that is what they are.

Perhaps we could rename BatchCursors to SortOrder or MergedCursors and have that as the abstraction? Or is per-slice access required somehow?

wiedld · 2023-10-23T18:10:54Z

By trying to break up the latest chunk into 2 PRs, I instead added to the confusion.
Going to combine into a single, larger PR to make clear where this is going.

alamb · 2023-10-23T20:45:15Z

Thank you @wiedld -- sorry for the conflicting advice. I do feel like we are making progress though

wiedld · 2023-10-24T21:31:35Z

Closing, as will eventually be replaced by this larger PR + lots of ascii diagrams.

wiedld added 4 commits October 16, 2023 10:19

refactor(7181): rename BatchBuilder as SortOrderBuilder

512558f

refactor(7181): move code around

cc48ef6

* CursorStream is a crate pub type * BatchCursor is now in a batch submod * Add the cascade merge skeleton, which is currently a multithread-safe wrapper around a single merge node

refactor(7181): metrics for polling of stream operator should only oc…

80afa5b

…cur at cascade root

feat(7181): move Cursor into SortOrderBuilder, and leave a the merge …

f6396d6

…node as only the loser tree * This sets up the SortOrderBuilder as owning (and slicing) the cursors.

wiedld commented Oct 16, 2023

View reviewed changes

chore(7181): rename to cascade root

01c2bad

wiedld marked this pull request as ready for review October 17, 2023 01:16

wiedld added 2 commits October 20, 2023 11:13

Merge branch 'main' into 7181/sort-order-builder

01fa590

fix(7181): incorporate the separation of Cursor from CursorValues

f09fc78

wiedld commented Oct 20, 2023

View reviewed changes

alamb reviewed Oct 21, 2023

View reviewed changes

wiedld force-pushed the 7181/sort-order-builder branch from 50233de to 96f72d9 Compare October 23, 2023 17:16

chore(7181): add documentation for the BatchCursor, and the unimpleme…

222c12a

…nted yield_sort_order()

wiedld force-pushed the 7181/sort-order-builder branch from 96f72d9 to 222c12a Compare October 23, 2023 17:35

wiedld commented Oct 23, 2023

View reviewed changes

tustvold reviewed Oct 23, 2023

View reviewed changes

wiedld marked this pull request as draft October 23, 2023 18:10

wiedld closed this Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(7181): Encapsulate cursor handling within the SortOrderBuilder. #7842

feat(7181): Encapsulate cursor handling within the SortOrderBuilder. #7842

wiedld commented Oct 16, 2023 •

edited

Loading

wiedld Oct 16, 2023

wiedld Oct 20, 2023 •

edited

Loading

wiedld Oct 20, 2023

alamb Oct 21, 2023

wiedld Oct 23, 2023 •

edited

Loading

wiedld Oct 20, 2023

alamb left a comment •

edited

Loading

alamb Oct 21, 2023

wiedld Oct 23, 2023 •

edited

Loading

alamb Oct 21, 2023

wiedld Oct 23, 2023 •

edited

Loading

alamb Oct 21, 2023

wiedld Oct 23, 2023 •

edited

Loading

tustvold Oct 23, 2023

wiedld Oct 23, 2023

tustvold Oct 23, 2023

tustvold Oct 23, 2023

tustvold Oct 23, 2023

tustvold Oct 23, 2023 •

edited

Loading

wiedld commented Oct 23, 2023

alamb commented Oct 23, 2023 •

edited

Loading

wiedld commented Oct 24, 2023

feat(7181): Encapsulate cursor handling within the SortOrderBuilder. #7842

feat(7181): Encapsulate cursor handling within the SortOrderBuilder. #7842

Conversation

wiedld commented Oct 16, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

wiedld Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Musings

Choose a reason for hiding this comment

wiedld Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

wiedld commented Oct 23, 2023

alamb commented Oct 23, 2023 • edited Loading

wiedld commented Oct 24, 2023

wiedld commented Oct 16, 2023 •

edited

Loading

wiedld Oct 20, 2023 •

edited

Loading

wiedld Oct 23, 2023 •

edited

Loading

alamb left a comment •

edited

Loading

wiedld Oct 23, 2023 •

edited

Loading

wiedld Oct 23, 2023 •

edited

Loading

wiedld Oct 23, 2023 •

edited

Loading

tustvold Oct 23, 2023 •

edited

Loading

alamb commented Oct 23, 2023 •

edited

Loading