-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an async
ParquetWriter for arrow
#1269
Comments
I'll try it @alamb |
Thanks @xudong963 - I personally suggest starting with the API design. As in write an example program showing how you would use the feature. Something like the following (sorting out what types to use for streams, etc): async fn main() {
// get output stream
let writer = AsyncParquet::new(...);
// write batches to the writer somehow (not sure how??)
for batch in batches {
....
}
} |
Thanks for your nice suggestion! @alamb, I'll start the task tomorrow. |
I felt the need to provide these new APIs and structs. pub struct AsyncArrowWriter<W: W: AsyncWrite + Unpin + Send> {
writer: FileStream<W>,
...
}
pub struct FileStream<W: AsyncWriter + Send + Unpined> {
writer: W
}
impl <...> FileStream {
pub async fn write(&mut self, ...) -> Result<> {
// 1. firstly, write header
write_header().await?;
// 2. secondly, write rowgroups
write_row_groups().await?;
// 3. thirdly, write metadata
write_metadata().await?;
}
} I noticed that we have a buffer optimization in the sync write part. I'm not sure if we need to keep the buffer in the async write part. |
Are you referring to the buffer added in #1214 as it would be really cool to keep that if possible? |
@tustvold Hi, this feature will help us a lot if it is implemented. Is there anyone working on this issue? I'm glad to try it if not. |
@ShiKaiWi Thanks! Feel free to try it. I'm busy with other things. |
I wonder if this could simply be implemented by adding an This would then allow providing a The nature of parquet is that an entire row group is buffered up and written in one shot, as data for different columns cannot be interleaved, so I'm not sure it is possible to do better than this |
@tustvold Thanks for your quick response. Here are my some thoughts about your proposal:
I guess the periodically gaining access is not an elegant way (but it indeed can solve the problem we encounter), because the polling frequency can't be determined easily, that is to say, polling with high frequency may introduce overhead and with slow frequency may lead to high memory usage by the inner writer
For now, I doesn't know the details of the writer very well, but I guess the buffer logic can still be kept in the async writer. |
Checking after each call to write should be fine. A more sophisticated writer could track the number of written rows and only check once they exceed the max row group size, in practice this is highly unlikely to be make a tangible performance difference
Assuming a single RecordBatch does not exceed the maximum size of a row group, the above approach should be optimal in this respect. In practice the parquet data is so much smaller than the corresponding arrow data, that the size of this buffer is likely irrelevant in the grand scheme of things.
If we can share as much between the sync and async implementations as possible, that would be beneficial. The separate async read path, whilst a necessary evil, is a non-trivial maintenance burden |
FYI I've created #3871 to track reducing the buffering of arrow data, this won't change the need to buffer an entire row group's data before flushing, but would allow buffering parquet encoded data instead of arrow data, which may reduce memory pressure. |
👍. This way is good enough to solve the problem we encounter.
After digging into the code, I find this is the hardest part to implement async writer. I guess your suggested way is the best workaround before the most codes of the sync and async implementation can be shared. |
@tustvold Maybe I can solve this issue by implementing your proposal:
Please let me know when you make the decision. |
How about we implement an AsyncWriter and see what it looks like, using the buffering approach, and see what it looks like |
Ok. I'll give it a try. |
@tustvold I find it is hard for me to implement the |
Looks good to me, I would consider putting it in a separate async_writer module behind the async feature flag, but otherwise it looks 👌 |
Of course, I will make it in the formal PR. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As it is nice to be able to read parquet files using rust
async
IO #111, it would be nice to write them as wellDescribe the solution you'd like
A
async
writer similar in spirit to the reader created in #111The text was updated successfully, but these errors were encountered: