Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

metesynnada · 2023-02-08T07:59:56Z

Describe the bug

The bug in the arrow-csv reader of the arrow-rs library affects the reading of CSV files in a FIFO environment where an EOF is not received until the file is closed. The bug occurs because the code that reads the buffer is designed to wait for additional bytes, even if the batch size is set to the correct size.

For example, if the batch size is set to 64 and 64 rows are provided to the reader, the decoder will have enough data to create a RecordBatch. However, when the loop iterates for the second time, the code waits for additional bytes at self.reader.fill_buf()?, causing a deadlock. This prevents tests for streaming purposes from working, even though this was supported before the PR #3604.

impl<R: BufRead> BufReader<R> {
    fn read(&mut self) -> Result<Option<RecordBatch>, ArrowError> {
        loop {
            let buf = self.reader.fill_buf()?;
            let decoded = self.decoder.decode(buf)?;
            if decoded == 0 {
                break;
            }
            self.reader.consume(decoded);
        }

        self.decoder.flush()
    }
}

To Reproduce

Add nix = "0.26.2" into dev-dependencies
Copy and paste the code and run the code within arrow-csv/src/reader/records.rs or any convenient place in arrow-csv.

#[cfg(test)]
mod pr {
    use crate::ReaderBuilder;
    use arrow_array::RecordBatch;
    use arrow_schema::{ArrowError, DataType, Field, Schema, SchemaRef};
    use nix::sys::stat;
    use nix::unistd;
    use std::fs::{File, OpenOptions};
    use std::io::BufRead;
    use std::io::BufReader as StdBufReader;
    use std::io::Write;
    use std::path::Path;
    use std::path::PathBuf;
    use std::sync::{Arc, Mutex};
    use std::thread;
    use std::time::{Duration, Instant};
    use tempfile::TempDir;

    fn create_fifo_file(
        tmp_dir: &TempDir,
        file_name: &str,
    ) -> Result<PathBuf, ArrowError> {
        let file_path = tmp_dir.path().join(file_name);
        if let Err(e) = unistd::mkfifo(&file_path, stat::Mode::S_IRWXU) {
            Err(ArrowError::CsvError(e.to_string()))
        } else {
            Ok(file_path)
        }
    }

    fn write_to_fifo(mut file: &File, line: &str) -> Result<usize, ArrowError> {
        file.write(line.as_bytes()).or_else(|e| {
            // Broken pipe error
            if e.raw_os_error().unwrap() == 32 {
                thread::sleep(Duration::from_millis(100));
                return Ok(0);
            }
            Err(ArrowError::CsvError(e.to_string()))
        })
    }

    fn read_from_csv<R: BufRead>(
        mut reader: R,
        schema: SchemaRef,
        batch_size: usize,
    ) -> Result<impl Iterator<Item = Result<RecordBatch, ArrowError>>, ArrowError> {
        let mut decoder = ReaderBuilder::new()
            .with_schema(schema)
            .with_batch_size(batch_size)
            .build_decoder();
        let mut next = move || {
            loop {
                //Deadlock happens here since we are waiting for bytes to produce the first batch.
                let buf = reader.fill_buf()?;
                let decoded = decoder.decode(buf)?;
                if decoded == 0 {
                    break;
                }
                reader.consume(decoded);
            }
            decoder.flush()
        };
        Ok(std::iter::from_fn(move || next().transpose()))
    }

    const TEST_BATCH_SIZE: usize = 50;

    #[test]
    fn csv_reader_env() -> Result<(), ArrowError> {
        // We use a lock to wait for a batch creation
        let waiting = Arc::new(Mutex::new(true));
        let waiting_thread = waiting.clone();
        let tmp_dir = TempDir::new()?;
        let fifo_path = create_fifo_file(&tmp_dir, "fifo_file.csv")?;
        let fifo_path_thread = fifo_path.clone();
        let joinable_iterator = (0..TEST_BATCH_SIZE).map(|_| "a".to_string());
        let fifo_writer = thread::spawn(move || {
            let first_file = OpenOptions::new()
                .write(true)
                .open(fifo_path_thread)
                .unwrap();
            for (cnt, string_col) in joinable_iterator.enumerate() {
                let line = format!("{string_col},{cnt}\n").to_owned();
                write_to_fifo(&first_file, &line).unwrap();
            }
            // This part prevents that we get an EOF in FIFO.
            while *waiting_thread.lock().unwrap() {
                thread::sleep(Duration::from_millis(200));
            }
        });
        let schema = Arc::new(Schema::new(vec![
            Field::new("a1", DataType::Utf8, false),
            Field::new("a2", DataType::UInt32, false),
        ]));

        let file = File::open(fifo_path).unwrap();
        let reader = StdBufReader::new(file);

        let mut read = read_from_csv(reader, schema.clone(), TEST_BATCH_SIZE)?;

        while let Some(Ok(batch)) = read.next() {
            // If we get a batch, the lock will be false and the experiment can finish.
            *waiting.lock().unwrap() = false;
            println!("We get a record batch");
        }
        fifo_writer.join().unwrap();
        Ok(())
    }
}

Expected behavior

For reproduced code: Produce the RecordBatch and finish.
For the algorithm, it should support the producing RecordBatch immediately after the necessary bytes are received.

Additional context
NA

cc @alamb @tustvold

The text was updated successfully, but these errors were encountered:

tustvold · 2023-02-08T08:11:21Z

the code waits for additional bytes at self.reader.fill_buf()?, causing a deadlock

I'm confused by this, what happens if there are less than batch size available or more? This feels like it just slightly changes the buffering behaviour, which isn't really guaranteed. Not saying we can't change this, but I'd like to understand the issue better. It almost feels like the deadlock is the fault of an overly restrictive test?

metesynnada · 2023-02-08T10:04:46Z

I intentionally make a test like this, to make it more clear.

I think waiting for the next byte to produce a RecordBatch even if we have the necessary bytes is avoidable.

We use this test pattern for testing stream pipelines. We test "If I give a batch, can I get a batch as output?", but testing with actual files becomes "If I give a batch_size + 1 row, can I get a batch as output?".

ozankabak · 2023-02-08T16:32:27Z

Deadlock doesn't seem to be the right word. However, the current behavior can result in unnecessarily long latencies if data comes in chunks aligned with batch boundaries. This is an edge case, but when it happens, it becomes a problem in streaming use cases. Thankfully, it is easily avoidable.

alamb · 2023-02-08T18:25:24Z

I wonder if the latency can be reduces by calling flush() on the underlying Decoder when the driver program knows (somehow) that it has received the end of a record and is not in the middle of decoding.

https://docs.rs/arrow-csv/32.0.0/arrow_csv/reader/struct.Decoder.html#method.flush

Perhaps something like

        let mut next = move || {
            loop {
                // force flush to produce RecordBatches if we have fed the entire input
                // that is available and are sure the data has only complete rows
                let decoded = if check_have_read_to_boundary() {
                  decoder.flush()
                  decoder.decode(buf)
                } else {
                  let buf = reader.fill_buf()?;
                  decoder.decode(buf); 
                }?;
                if decoded == 0 {
                    break;
                }
                reader.consume(decoded);
            }
            decoder.flush()
        };

tustvold · 2023-02-08T18:50:45Z

Would you be able to test out #3677 and see if it meets your requirements, if so I can polish it up with some tests, etc...

metesynnada · 2023-02-08T20:09:41Z

@tustvold thank you for your effort. #3677 it meets our requirements.

* Add CSV Decoder::capacity (#3674) * Add test * Remove unnecessary extern * Add docs

metesynnada added the bug label Feb 8, 2023

metesynnada changed the title ~~Deadlock in arrow-csv reader for FIFO file reading~~ Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary Feb 8, 2023

tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 8, 2023

Add CSV Decoder::capacity (apache#3674)

9e229f6

tustvold mentioned this issue Feb 8, 2023

Add CSV Decoder::capacity (#3674) #3677

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 8, 2023

Add CSV Decoder::capacity (apache#3674)

73a7644

tustvold closed this as completed in #3677 Feb 10, 2023

tustvold added a commit that referenced this issue Feb 10, 2023

Add CSV Decoder::capacity (#3674) (#3677)

3e08a75

* Add CSV Decoder::capacity (#3674) * Add test * Remove unnecessary extern * Add docs

tustvold added the arrow Changes to the arrow crate label Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

metesynnada commented Feb 8, 2023 •

edited

Loading

tustvold commented Feb 8, 2023

metesynnada commented Feb 8, 2023 •

edited

Loading

ozankabak commented Feb 8, 2023

alamb commented Feb 8, 2023

tustvold commented Feb 8, 2023

metesynnada commented Feb 8, 2023

Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

Comments

metesynnada commented Feb 8, 2023 • edited Loading

tustvold commented Feb 8, 2023

metesynnada commented Feb 8, 2023 • edited Loading

ozankabak commented Feb 8, 2023

alamb commented Feb 8, 2023

tustvold commented Feb 8, 2023

metesynnada commented Feb 8, 2023

metesynnada commented Feb 8, 2023 •

edited

Loading

metesynnada commented Feb 8, 2023 •

edited

Loading