Use `BufReader` for LocalFileReader to revert performance regression in parquet reading #1366

Dandandan · 2021-11-26T16:10:37Z

Which issue does this PR close?

Rationale for this change

Parquet reading was slow. This recovers the performance regression in the TPC-H benchmark.

There is still a slowdown in query 10 - and other queries, but this is unrelated to Parquet reading #1367 (and performance still improves from roughly 10 to 7s on that query).

What changes are included in this PR?

Use the BufReadtrait instead ofRead`.

Are there any user-facing changes?

alamb

Thank you @Dandandan -- I am not sure about the need for the BufRead trait, but otherwise looks good to me.

Very nice 🕵️ work

alamb · 2021-11-27T11:40:20Z

datafusion/src/datasource/object_store/local.rs

@@ -18,7 +18,7 @@
 //! Object store that represents the Local File System.

 use std::fs::{self, File, Metadata};
-use std::io::{Read, Seek, SeekFrom};
+use std::io::{BufRead, BufReader, Read, Seek, SeekFrom};


TIL BufRead trait

alamb · 2021-11-27T11:41:18Z

datafusion/src/datasource/object_store/local.rs

        // A new file descriptor is opened for each chunk reader.
        // This okay because chunks are usually fairly large.
        let mut file = File::open(&self.file.path)?;
        file.seek(SeekFrom::Start(start))?;
-        Ok(Box::new(file.take(length as u64)))
+
+        let file = BufReader::new(file.take(length as u64));


It seems like this is the actual fix, right? Is the change to require the BufRead trait needed?

Not sure. Rerunning some benchmarks now without the trait.

Yes - looks like BufRead wasn't needed 🎉

xudong963 · 2021-11-27T16:08:30Z

🎉

rdettai · 2021-11-28T17:38:24Z

Thanks @Dandandan ! Can you quickly explain what the reason for the slowdown was exactly?

Dandandan · 2021-11-28T18:37:35Z

Thanks @Dandandan ! Can you quickly explain what the reason for the slowdown was exactly?

As far as I can explain:

The earlier code used the parquet-based API to read from a file, which uses a BufReader internally, which is crucial for the performance.

By introducing the object storage abstraction, we were directly reading from a File instance without any buffering in between, i.e. having lot's of extra calls to the OS (as you also hinted at in #1363).
This leads to both slowdown in loading the data but also was very expensive in the part that reads metadata /statistics (which normally takes something like <1ms locally). Probably that part does many small read calls.

By wrapping the File instance in the BufReader we avoid those calls to the OS.

Maybe a potential improvement would be having a bit more control, such as setting the capacity of the buffer.

Dandandan added 2 commits November 26, 2021 17:05

Use BufRead to improve performance

f10bd74

Undo stat change

70322f0

github-actions bot added the datafusion Changes in the datafusion crate label Nov 26, 2021

Dandandan added 2 commits November 26, 2021 17:13

Format

fbb1410

Undo stat change

9cbaf67

Dandandan changed the title ~~Use BufRead for ChunkObjectReader to improve performance~~ Use BufRead for ChunkObjectReader to revert performance regression in parquet reading Nov 26, 2021

Unneeded imports

10c2a8c

Dandandan mentioned this pull request Nov 26, 2021

TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367

Closed

Use BufRead in test code

0926029

Dandandan requested review from alamb and houqp November 26, 2021 17:30

alamb approved these changes Nov 27, 2021

View reviewed changes

Revert trait change

0843b61

Dandandan changed the title ~~Use BufRead for ChunkObjectReader to revert performance regression in parquet reading~~ Use BufReader for ChunkObjectReader to revert performance regression in parquet reading Nov 27, 2021

Dandandan changed the title ~~Use BufReader for ChunkObjectReader to revert performance regression in parquet reading~~ Use BufReader for LocalFileReader to revert performance regression in parquet reading Nov 27, 2021

Dandandan merged commit 7ee85b2 into apache:master Nov 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `BufReader` for LocalFileReader to revert performance regression in parquet reading #1366

Use `BufReader` for LocalFileReader to revert performance regression in parquet reading #1366

Dandandan commented Nov 26, 2021 •

edited

Loading

alamb left a comment

alamb Nov 27, 2021

alamb Nov 27, 2021

Dandandan Nov 27, 2021

Dandandan Nov 27, 2021

xudong963 commented Nov 27, 2021

rdettai commented Nov 28, 2021

Dandandan commented Nov 28, 2021

Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366

Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366

Conversation

Dandandan commented Nov 26, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 27, 2021

Choose a reason for hiding this comment

alamb Nov 27, 2021

Choose a reason for hiding this comment

Dandandan Nov 27, 2021

Choose a reason for hiding this comment

Dandandan Nov 27, 2021

Choose a reason for hiding this comment

xudong963 commented Nov 27, 2021

rdettai commented Nov 28, 2021

Dandandan commented Nov 28, 2021

Use `BufReader` for LocalFileReader to revert performance regression in parquet reading #1366

Use `BufReader` for LocalFileReader to revert performance regression in parquet reading #1366

Dandandan commented Nov 26, 2021 •

edited

Loading