-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with rust-s3 and caching #53
base: main
Are you sure you want to change the base?
Conversation
thank you very much for the contribution @mateuszkj 👍 CC @seddonm1 (as the original core author) and @houqp (has helped with reviewing changes / design here and has experience with delta-rs which i believe connects to s3 with rusoto). To confirm the primary reason for switching to Indeed regarding the cache - can you please include a more implementation details and a description of how specifically it is used? Could we perhaps start with just using the cache functionality with aws-sdk-rust? regarding the IMDS timeout - do you have an example in other languages of how to disable it that we could use as reference? and regarding testing other s3 servers - if it isnt too much work would you be able to run CI against them using docker files (it looks like seaweed makes that easy)? I think this would be a great value add to the project regardless and would make comparing results easier. Given the size of the change i will need some time to review the code - so please bare with me as i will likely need at least a few days to review and test. thank you again for the contribution. |
thanks for your fast response @matthewmturner After second thought I think this MR should not be merged. It can be done better. Can you mark this MR as draft?. Moving cache mechanism to different crateNow I consider that caching mechanism could be moved to its own for example let s3_file_system = S3FileSystem::new(...).await?;
let s3_cached_file_system = Arc::new(ObjectStoreCache::new(s3_file_system, ObjectStoreCacheOptions::from_envs()?));
runtime_env.register_object_store("s3", s3_cached_file_system); with cache in different crate, we could have two S3 client implementations or other remote storage systems. Cache descriptionFirst let look how To disable cache set environment variables: export RUST_LOG=info,datafusion_objectstore_s3=debug
export DATAFUSION_S3_MIN_REQUEST_SIZE=0
export DATAFUSION_S3_USE_METADATA_PREFETCH=false Below is example of querying data using SQL simillar to: SELECT
s.col3_text,
s.col4_int
FROM stops s
WHERE s.col1_time >= CAST('2022-03-02 00:00:00') AS Timestamp)
AND s.col1_time < CAST('2022-03-03 00:00:00') AS Timestamp)
AND s.col2_text = 'xxx'
ORDER BY s.col1_time; Query uses 4 columns and filters by time and some text field. I have multiple *.parquet files in my table (89 files). I greped only one for clarity.
What we see in logs:
My understanding to implement cache which was:
CI with other self-hosted S3 serversAbout running CI with other self-hosted S3 servers is great idea. I could do it when i will moving cache to another crate. But for now I want to focus on sth else. aws-sdk-rust and rust-s3About why I dropped One of drawbacks of Maybe we need two S3 implementations, one for AWS services with official |
Thanks for the detailed write up @mateuszkj - it makes sense. I think that bringing up the cache on the main datafusion repo would be a good starting point to see what the community thinks about this. For example there is an active HDFS community using datafusion as well (https://github.com/datafusion-contrib/datafusion-objectstore-hdfs) and perhaps this could benefit them. In general there is a lot of focus on improving performance - and parallelism within datafusion (for example apache/datafusion#2199) so i think its a good time to bring it up. I can help with creating that if you would like. Regarding CI and testing with other S3 providers - I can work on this. I think its a valuable addition to the project regardless and should help with whatever changes come from this. I will look into reproducing the IMDS issue. I think once we have a better understanding of it then we can consider whether it makes sense to use alternative s3 implementations. Thanks working with me on this! |
Hi, thanks for working on this. I'm currently actively working on making it so that IOx can make use of more of the DataFusion functionality for object storage, where it currently does its own thing. I am therefore very interested in helping drive improvements in this space 👍
This might be a behaviour we want to fix upstream in DataFusion/arrow-rs, as opposed to working around with caching. For reference apache/arrow-rs#1473 contains some investigative work refining the interface exposed by the parquet crate, and in particular apache/arrow-rs#1509 might be of interest. It is all still proposals and experiments at this stage, but I would be more than happy to help draft up some issues if you (or anyone else) wanted to lend a hand getting this over the line.
Not needing to scan or list object storage in order to plan the query seems very sensible. One thought might be to integrate with Object Storage at the TableProvider level, as opposed to the ObjectStore level to allow using a dedicated catalog. In IOx we use a custom catalog, but I could see adding support for a Hive compatible MetaStore as being a valuable feature for the DataFusion ecosystem. Using caching to accelerate queries on datasets that lack a dedicated catalog is definitely still useful, just mentioning that typically one would have some sort of catalog. I've personally had a lot of success using lambdas to populate an AWS Glue catalog, but there are loads of alternatives in this space. apache/datafusion#2079 might also be relevant and concerns how the TableProvider translates the catalog information into a physical query plan.
I'm not familiar with aws-sdk-rust, as I've only ever used rusoto, but IMDS is the mechanism by which instances running in AWS obtain IAM credentials with which to authorize with object storage. It does this by talking to an HTTP service running at a special IP. This won't exist when running on your local machine and so you might need to provide dummy AWS access credentials as environment variables, e.g. |
@tustvold thank you for the insight - very helpful. personally, i am interested in learning about and implementing these types of performance improvements. so if you could draft some issues outlining what you are looking to accomplish i would be happy to try and help. |
Ok I'll see what I can do, I have quite a few plates in the air at the moment so might be a day or two before I can produce anything coherent, but I'll be sure to tag you on any issues once I file them 😀 |
@tustvold No rush - I have a couple things I'll need to close before being able to start on that anyway. Thanks again. |
Hi everyone. I wrote the base implementation so any blame can be directed at me 😉. I think moving away from the official AWS SDK would be a mistake as they are developing a strong Rust practice so I would expect it to be better maintained than the others going forward (maintaining a library for someone else's API is a bit of a thankless task). This was always a bit of a hacky implementation until Async datasources are properly implemented. It is good the IOx team is onboard as I hope we can arrive on one canonical implementation - that deals with all of the issues above. That means any caching ideas would be good to incorporate. 👍 |
@seddonm1 No blame being directed - we wouldn't be having this conversation without your efforts. |
As @tustvold mentioned, |
Another place that I think caching could help is at the object store level to perform look ahead scan, i.e. fetch the next chunk of data from S3 while the system is performing compute on the previous chunk. |
This merge requests provide:
aws-sdk-rust
in favor of much simplerrust-s3
Caching
Motivation why caching is needed is because current datafusion implementation is not optimal for network file operations:
Results of very unscientific tests on local machine with my example SQL query:
Why not aws-sdk-rust?
I don't know how AWS works, units test on my local machine took several minutes to complete (with success). I think it was because of IMDS timeouts. I couldn't find how to disable it, so I have chosen much simpler implementation of S3 client.
I had also problems with
rust_s3
and other S3 self-hosted server. I created MRs for fix it: durch/rust-s3#267 and datenlord/s3-server#114What can be done next?