Add Push-Based CSV Decoder #3604

tustvold · 2023-01-25T20:23:31Z

Which issue does this PR close?

Closes #.

Rationale for this change

Inspired by the RawDecoder interface added in https://github.com/apache/arrow-rs/pull/3479/files I wanted to add a similar interface to the CSV reader. This PR does this

What changes are included in this PR?

Are there any user-facing changes?

arrow-csv/src/reader/records.rs

tustvold · 2023-01-25T20:26:57Z

arrow-csv/src/reader/records.rs

        }
    }

-    /// Clears and then fills the buffers on this [`RecordReader`]
-    /// returning the number of records read
-    fn fill_buf(&mut self, to_read: usize) -> Result<usize, ArrowError> {


Effectively all this PR does is lift the state from fill_buf's stack frame onto the struct

tustvold · 2023-01-25T20:27:47Z

arrow-csv/src/reader/records.rs

-        let mut skipped = 0;
-        while to_skip > skipped {
-            let read = self.fill_buf(to_skip.min(1024))?;
-            if read == 0 {


Returning an error here was a quick workaround for an infinite loop, added in #3470

This PR handles this properly and simply returns no [RecordBatch] if the offset exceeds the length of the file - I think this makes for a better UX

alamb

I went through the code carefully -- thank you @tustvold

My only question about this PR is if there is sufficient test coverage that feed data in small / quasi-random buffer sizes to cover all the decoding corner cases (i.e. picking up decoding state from where it is)?

alamb · 2023-01-27T11:35:36Z

arrow-csv/src/reader/mod.rs

+///
+/// See [`Reader`] for a higher-level interface for interface with [`Read`]
+///
+/// The push-based interface facilitates integration with sources that yield arbitrarily


alamb · 2023-01-27T11:36:44Z

arrow-csv/src/reader/mod.rs

+    /// last call to [`Self::flush`], or `buf` is exhausted. Any remaining bytes
+    /// should be included in the next call to [`Self::decode`]
+    ///
+    /// There is no requirement that `buf` contains a whole number of records, facilitating


arrow-csv/src/reader/records.rs

alamb · 2023-01-27T11:41:24Z

arrow-csv/src/reader/records.rs

+    /// Clears the current contents of the decoder
+    pub fn clear(&mut self) {
+        // This does not reset current_field to allow clearing part way through a record
+        self.offsets_len = 1;


I don't understand what the usecase for clear is here -- how would it clear part way through a record and then pick back up

…face

ursabot · 2023-01-27T14:42:00Z

Benchmark runs are scheduled for baseline = 9728c67 and contender = d9c2681. d9c2681 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add Push-Based CSV Decoder

2e96f46

github-actions bot added the arrow Changes to the arrow crate label Jan 25, 2023

tustvold commented Jan 25, 2023

View reviewed changes

arrow-csv/src/reader/records.rs Show resolved Hide resolved

tustvold commented Jan 25, 2023

View reviewed changes

Clippy

9dbed0f

alamb approved these changes Jan 27, 2023

View reviewed changes

tustvold added 3 commits January 27, 2023 12:07

Merge remote-tracking branch 'upstream/master' into csv-decoder-inter…

b397cc1

…face

More tests

8b19f1c

Clippy

fb32ba3

tustvold merged commit d9c2681 into apache:master Jan 27, 2023

metesynnada mentioned this pull request Feb 8, 2023

Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

Closed

tustvold mentioned this pull request Apr 10, 2023

Async CSV reader #78

Closed

This was referenced Apr 12, 2023

Write blog about improvements in JSON and CSV processing #4062

Closed

Write blog about improvements in JSON and CSV processing #4072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Push-Based CSV Decoder #3604

Add Push-Based CSV Decoder #3604

tustvold commented Jan 25, 2023

tustvold Jan 25, 2023

tustvold Jan 25, 2023

alamb left a comment

alamb Jan 27, 2023

alamb Jan 27, 2023

alamb Jan 27, 2023

ursabot commented Jan 27, 2023

Add Push-Based CSV Decoder #3604

Add Push-Based CSV Decoder #3604

Conversation

tustvold commented Jan 25, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Jan 25, 2023

Choose a reason for hiding this comment

tustvold Jan 25, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 27, 2023

Choose a reason for hiding this comment

alamb Jan 27, 2023

Choose a reason for hiding this comment

alamb Jan 27, 2023

Choose a reason for hiding this comment

ursabot commented Jan 27, 2023