-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved configuration / resource reuse for S3 #915
Conversation
This reverts commit 522b3d7.
Learn the region that buckets are in, where possible Configure S3 the client to default to the region that the resource is in, to save it potentially finding the region itself on a per-request basis Dont' reuse clients across multiple regions (for the config reason) Don't reuse clients across multiple threads (S3 resources and sessions are not guaranteed to be thread safe) Do reuse clients / resources / buckets in other cases - these will be cached until the command exits This is the also the groundwork for fetching tiles during WC checkout, which will probably need to happen on multiple threads.
try: | ||
response = get_s3_client().head_bucket(Bucket=bucket) | ||
return response["ResponseMetadata"]["HTTPHeaders"]["x-amz-bucket-region"] | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main situation this might fail in seems to be if we have GetObject
but not ListBucket
permissions - in which case we can fetch the tiles but not get the bucket details. Is that right?
Do we require the ListBucket
perm elsewhere for read-only access anyway, or is it only required for importing tiles? If we already need it elsewhere we might as well drop the except clause here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would require it if the user tried to use wildcards but we wouldn't require it if they listed in full the object(s) that make up the dataset. I'm inclined to leave this in since it just makes Kart a bit more robust
kart/s3_util.py
Outdated
@@ -12,37 +14,82 @@ | |||
|
|||
# Utility functions for dealing with S3 - not yet launched. | |||
|
|||
# Lots of functions in this file look like this: | |||
# >> @functools.lru_cache() | |||
# >> def _get_some_thread_unsafe_s3_resource(region, thread_hash): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe write a better decorator that wraps this up so you don't have to do the weird thread stuff in two versions of every function, eg @thread_local_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Don't store S3 regions - we don't need them, and they could (occasionally) change over time,
if somebody carefully deletes and recreates a bucket. And, it's extra junk to store.
Learn the region that buckets are in, where possible.
Configure S3 the client to default to the region that the resource is in,
to save it potentially finding the region itself on a per-request basis.
Dont' reuse clients across multiple regions (for the config reason).
Don't reuse clients across multiple threads (S3 resources and sessions are not guaranteed to be thread safe).
Do reuse clients / resources / buckets otherwise - these will be cached until the command exits.
This is the also the groundwork for fetching tiles during WC checkout,
which will probably need to happen on multiple threads.
Related links:
#905
Checklist: