Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud computing guide #21

Closed
3 tasks done
wlandau opened this issue Aug 20, 2020 · 8 comments
Closed
3 tasks done

Cloud computing guide #21

wlandau opened this issue Aug 20, 2020 · 8 comments
Assignees

Comments

@wlandau
Copy link
Member

wlandau commented Aug 20, 2020

Prework

  • I understand and agree to this repository's code of conduct.
  • I understand and agree to this repository's contributing guidelines.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Proposal

There is a community misconception that targets (and drake) do not have HPC capabilities beyond parallel computing over the cores of a single local machine. On the contrary, both tools support distributed computing on clusters (guides here and here) and the workers do not necessarily need access to the file system of the master process. (In fact, I designed targets with an efficient dynamic branching model to go beyond the inherent limitations of map-reduce-like scheduling algorithms and conserve computing resources.) However, I do realize that data scientists from smaller institutions do not always have access to clusters, and an increasing number of folks use AWS. AWS ParallelCluster could be a way to deploy pipelines to the cloud without any need to modify targets itself. If it works, we should probably write a tutorial either in the existing HPC chapter or a chapter of its own.

ropensci/tarchetypes#8 could be an alternative way to deploy to AWS. The advantage of ropensci/tarchetypes#8 is that we should also get the data versioning capabilities of Metaflow for free, and Metaflow may take care of a lot of the AWS setup. However, each new tar_metaflow() will require its own local R worker in order to avoid blocking the master process, which is not ideal.

@wlandau
Copy link
Member Author

wlandau commented Aug 31, 2020

The biggest potential difference I see to this (relative to Metaflow's approach to AWS) is that the targets data store will probably live locally. But it's not so bad because drake users want this behavior anyway so they can explore data interactively (example: ropensci/drake#1295). And as @noamross pointed out, aws.s3::s3sync() can upload the data store to an S3 bucket. _targets/ is super light relative to .drake/, so this shouldn't be too painful for most projects.

@wlandau
Copy link
Member Author

wlandau commented Sep 4, 2020

If cloudyr packages still work, tar_make_future() can probably already talk to multiple AWS instances: https://gist.github.com/DavisVaughan/5aac4a2757c0947a499d25d28a8ca89b. But the data will still live locally.

@wlandau
Copy link
Member Author

wlandau commented Sep 4, 2020

@MilesMcBain, your team uses AWS, right? What's your preferred way to interact with it?

@wlandau
Copy link
Member Author

wlandau commented Sep 6, 2020

I read up more on AWS ParallelCluster, AWS Batch, and Metaflow's HPC, and targets' capabilities are not ready for a cloud computing guide yet. But I think development on top of paws can get us much of the way there.

@wlandau wlandau closed this as completed Sep 6, 2020
@MilesMcBain
Copy link

Hey @wlandau, so far we have preferred to call the AWS CLI directly. This is mainly due to a combination of very simple workflows, and uncertainty about the stability of the AWS-R ecosystem.

@wlandau
Copy link
Member Author

wlandau commented Sep 8, 2020

That's helpful, I know cloudyr has had a rough time. What do you think about paws? I'm hoping it can help with ropensci/targets#152 either directly or through futureverse/future#415 or mschubert/clustermq#102 (comment).

@wlandau
Copy link
Member Author

wlandau commented Sep 28, 2020

Reopening. I plan to write about ropensci/targets#176 at least.

I will keep my eye on R + cloud packages. Looks like aws.s3 has been updated in May, which is a good sign. And paws is under constant development but isn't quite there for the new S3 feature set in targets.

@wlandau-lilly
Copy link
Collaborator

Just wrote about S3 integration in the new cloud chapter. Will reopen after ropensci/targets#152.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants