-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffle script for training runs out of memory #360
Comments
Hi @anands-repo , when you say it runs out of memory, are you using |
Hi @pichuan I am not running on Google Cloud, but on a local machine. So I went with the default runner. When I use DataflowRunner the shuffle script requests arguments relevant to GCS. For example I get errors such as:
here is actually a valid path in my machine. Kindly advise. Thanks! |
Command that works, but runs out of memory is this:
Command that doesn't run and gives error:
Obtained error: I also tried the SparkRunner which works, but which runs into the same issue of memory. It seems DirectRunner and SparkRunner try to shuffle everything in memory (RAM) and do not use local storage. May be DataflowRunner uses local storage (it accepts a --temp_location argument)? However, this is not available to me on my local machine since the DataflowRunner seems to require the code to be run on Google Cloud. |
Hi @anands-repo The point of using Dataflow is to run things in a distributed fashion, which means it shouldn't be running on your local machine. If you can't use the shuffle script, you can consider a less fine-grained shuffle "hack" in this older document http://bit.ly/train-deepvariant (Note that this doc is a one-off document, and is not maintained by our team. Please consider it as a possible example that you'll probably need to tweak for your own use case) |
Hi @pichuan Thanks for the advise. Will look into these possibilities. The coarse-grained shuffling would be easiest, however it is mentioned that shuffling is an important step in the document you mentioned as well as the training page. Technically stochastic/batch gradient descent does depend on random batches. I will look into spark, as well as other options like dask or torque (which would need a script hack). If I have a setup that works for local clusters, I will share it. Thanks! |
As you know, I am running the shuffle script using Spark. I am wondering how many output files are expected from running the script. When I use DirectRunner, I get a single output file. When I use the SparkRunner I get as many output files as there are input files fitting the pattern (I have noticed this mismatch between spark/direct runner in another situation as well: https://stackoverflow.com/questions/64450391/apache-beam-beam-flatten-doesnt-flatten-files-with-sparkrunner-but-does-so-wi). Is this the expected result when using Dataflow runner as well? Basically, I am simply trying to do a sanity check to make sure that the shuffler isn't simply reading in the data and copying it without shuffling, or simply shuffling within each shard. Thanks! |
Just for the record, I wrote a script which shuffles the records locally using as little memory as possible: TFrecordShuffler. It uses about as much RAM as the total size of the input (record) files on disk. Downside is obviously the time it takes which is much longer than with a distributed google cloud or spark system I imagine. As an example, shuffling ~30 million records totaling 125 GB of files took 46h (wall-clock and CPU) and 150 GB of RAM. |
Awesome, thank you @GuillaumeHolley . I'll make a note to add this as an option in our training tutorial in the future. |
@GuillaumeHolley FYI, I'm working on updating the tutorial. I will add this sentence: NOTE: If you prefer shuffling locally, please take a look at this user-provided If you want to suggest a different sentence in the tutorial, please let me know! |
@pichuan when shuffle the datasets using local runner,
Is there a reason we use Thank you. |
Hi @yinshiyi , Hello :D First, this is a pretty old bug. It might be easier to open a new issue. Otherwise our team member on rotation might not notice it. To your question, are you asking about https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-training-case-study.md ? In that documentation, we provided two examples of using the shuffle script. One is with:
the other one is with:
If you intend to use Dataflow, please refer to the command that uses @yinshiyi , if you want to discuss further, please open a new issue with a few more details on which command you were using. That will be helpful for our team member to provide more support. Thank you :) |
Describe the issue:
Shuffle script for tfrecords (https://github.com/google/deepvariant/blob/r1.0/docs/deepvariant-training-case-study.md) runs out of memory when using a training set from multiple BAM files.
This is what I followed:
This requires over 230 GB of CPU RAM, and the process is eventually killed. I do not know whether the memory requirement will keep growing beyond this point. Is there another way to deal with this situation? For example, it would be possible to run shuffling for data from each bam file independently. However, I am not sure what the flow would look like after that point.
Setup
The text was updated successfully, but these errors were encountered: