-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Push-Based CSV Decoder #3604
Conversation
} | ||
} | ||
|
||
/// Clears and then fills the buffers on this [`RecordReader`] | ||
/// returning the number of records read | ||
fn fill_buf(&mut self, to_read: usize) -> Result<usize, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effectively all this PR does is lift the state from fill_buf's stack frame onto the struct
let mut skipped = 0; | ||
while to_skip > skipped { | ||
let read = self.fill_buf(to_skip.min(1024))?; | ||
if read == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error here was a quick workaround for an infinite loop, added in #3470
This PR handles this properly and simply returns no [RecordBatch
] if the offset exceeds the length of the file - I think this makes for a better UX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code carefully -- thank you @tustvold
My only question about this PR is if there is sufficient test coverage that feed data in small / quasi-random buffer sizes to cover all the decoding corner cases (i.e. picking up decoding state from where it is)?
/// | ||
/// See [`Reader`] for a higher-level interface for interface with [`Read`] | ||
/// | ||
/// The push-based interface facilitates integration with sources that yield arbitrarily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// last call to [`Self::flush`], or `buf` is exhausted. Any remaining bytes | ||
/// should be included in the next call to [`Self::decode`] | ||
/// | ||
/// There is no requirement that `buf` contains a whole number of records, facilitating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// Clears the current contents of the decoder | ||
pub fn clear(&mut self) { | ||
// This does not reset current_field to allow clearing part way through a record | ||
self.offsets_len = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the usecase for clear
is here -- how would it clear part way through a record and then pick back up
Benchmark runs are scheduled for baseline = 9728c67 and contender = d9c2681. d9c2681 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
Inspired by the RawDecoder interface added in https://github.com/apache/arrow-rs/pull/3479/files I wanted to add a similar interface to the CSV reader. This PR does this
What changes are included in this PR?
Are there any user-facing changes?