Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle script for training runs out of memory #360

Closed
anands-repo opened this issue Oct 11, 2020 · 11 comments
Closed

Shuffle script for training runs out of memory #360

anands-repo opened this issue Oct 11, 2020 · 11 comments

Comments

@anands-repo
Copy link

Describe the issue:
Shuffle script for tfrecords (https://github.com/google/deepvariant/blob/r1.0/docs/deepvariant-training-case-study.md) runs out of memory when using a training set from multiple BAM files.

This is what I followed:

This requires over 230 GB of CPU RAM, and the process is eventually killed. I do not know whether the memory requirement will keep growing beyond this point. Is there another way to deal with this situation? For example, it would be possible to run shuffling for data from each bam file independently. However, I am not sure what the flow would look like after that point.

Setup

  • Operating system: Ubuntu Bionic
  • DeepVariant version: 1.0.0
  • Installation method (Docker, built from source, etc.): Docker
@pichuan
Copy link
Collaborator

pichuan commented Oct 13, 2020

Hi @anands-repo , when you say it runs out of memory, are you using DataflowRunner like in our documentation?

@anands-repo
Copy link
Author

anands-repo commented Oct 13, 2020

Hi @pichuan

I am not running on Google Cloud, but on a local machine. So I went with the default runner.

When I use DataflowRunner the shuffle script requests arguments relevant to GCS. For example I get errors such as:

Invalid GCS path (<PATH>), given for the option: temp_location

here is actually a valid path in my machine.

Kindly advise. Thanks!

@anands-repo
Copy link
Author

anands-repo commented Oct 13, 2020

Command that works, but runs out of memory is this:

python $SCRIPTPATH/shuffle_tfrecords_beam.py \
  --input_pattern_list=$INPUT_PATTERN \
  --output_pattern_prefix=$OUTPUT_PREFIX \
  --output_dataset_config_pbtxt=$OUTPUT_DATASET_CONFIG_PBTXT \
  --output_dataset_name=$OUTPUT_DATASET_NAME

Command that doesn't run and gives error:

python $SCRIPTPATH/shuffle_tfrecords_beam.py \
  --input_pattern_list=$INPUT_PATTERN \
  --output_pattern_prefix=$OUTPUT_PREFIX \
  --output_dataset_config_pbtxt=$OUTPUT_DATASET_CONFIG_PBTXT \
  --output_dataset_name=$OUTPUT_DATASET_NAME \
  --job_name=$JOBNAME \
  --project=$PROJECT_NAME \
  --temp_location=$TEMPLOCATION \
  --save_main_session \
  --region us-east1

Obtained error: Invalid GCS path (<PATH>), given for the option: temp_location

I also tried the SparkRunner which works, but which runs into the same issue of memory. It seems DirectRunner and SparkRunner try to shuffle everything in memory (RAM) and do not use local storage. May be DataflowRunner uses local storage (it accepts a --temp_location argument)? However, this is not available to me on my local machine since the DataflowRunner seems to require the code to be run on Google Cloud.

@pichuan
Copy link
Collaborator

pichuan commented Oct 13, 2020

Hi @anands-repo

The point of using Dataflow is to run things in a distributed fashion, which means it shouldn't be running on your local machine.
I assume that Spark runner should allow you to run with distributed workers as well if you set it up correctly, but I have never used it myself.
Our team doesn't support different distributed setup for Beam. Please refer to https://beam.apache.org/documentation/runners/spark/ to see if you can set up Sparker runner so that it doesn't use your local memory. If you do figure out a good setup to run on Spark, please feel free to share some tips here so people can use it in the future!

If you can't use the shuffle script, you can consider a less fine-grained shuffle "hack" in this older document http://bit.ly/train-deepvariant (Note that this doc is a one-off document, and is not maintained by our team. Please consider it as a possible example that you'll probably need to tweak for your own use case)

@pichuan pichuan closed this as completed Oct 13, 2020
@anands-repo
Copy link
Author

Hi @pichuan Thanks for the advise. Will look into these possibilities.

The coarse-grained shuffling would be easiest, however it is mentioned that shuffling is an important step in the document you mentioned as well as the training page. Technically stochastic/batch gradient descent does depend on random batches.

I will look into spark, as well as other options like dask or torque (which would need a script hack). If I have a setup that works for local clusters, I will share it.

Thanks!

@anands-repo
Copy link
Author

@pichuan

As you know, I am running the shuffle script using Spark. I am wondering how many output files are expected from running the script.

When I use DirectRunner, I get a single output file. When I use the SparkRunner I get as many output files as there are input files fitting the pattern (I have noticed this mismatch between spark/direct runner in another situation as well: https://stackoverflow.com/questions/64450391/apache-beam-beam-flatten-doesnt-flatten-files-with-sparkrunner-but-does-so-wi).

Is this the expected result when using Dataflow runner as well? Basically, I am simply trying to do a sanity check to make sure that the shuffler isn't simply reading in the data and copying it without shuffling, or simply shuffling within each shard.

Thanks!

@GuillaumeHolley
Copy link

Just for the record, I wrote a script which shuffles the records locally using as little memory as possible: TFrecordShuffler. It uses about as much RAM as the total size of the input (record) files on disk. Downside is obviously the time it takes which is much longer than with a distributed google cloud or spark system I imagine. As an example, shuffling ~30 million records totaling 125 GB of files took 46h (wall-clock and CPU) and 150 GB of RAM.

@pichuan
Copy link
Collaborator

pichuan commented Jan 26, 2022

Awesome, thank you @GuillaumeHolley . I'll make a note to add this as an option in our training tutorial in the future.

@pichuan
Copy link
Collaborator

pichuan commented May 25, 2022

@GuillaumeHolley FYI, I'm working on updating the tutorial. I will add this sentence:

NOTE: If you prefer shuffling locally, please take a look at this user-provided
shuffler option: #360 (comment)

If you want to suggest a different sentence in the tutorial, please let me know!

@yinshiyi
Copy link
Contributor

@pichuan when shuffle the datasets using local runner, direct_num_workers is set to 0, it will use all the local CPUs.
I got this warning that make me thinking

WARNING:apache_beam.runners.portability.fn_api_runner.fn_runner:If direct_num_workers is not equal to 1, direct_running_mode should be `multi_processing` or `multi_threading` instead of `in_memory` in order for it to have the desired worker parallelism effect. direct_num_workers: 8 ; running_mode: in_memory

Is there a reason we use in_memory rather than other modes?

Thank you.

@pichuan
Copy link
Collaborator

pichuan commented Oct 17, 2024

Hi @yinshiyi , Hello :D

First, this is a pretty old bug. It might be easier to open a new issue. Otherwise our team member on rotation might not notice it.

To your question, are you asking about https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-training-case-study.md ?

In that documentation, we provided two examples of using the shuffle script. One is with:

  --runner=DirectRunner \

the other one is with:

  --runner=DataflowRunner \

If you intend to use Dataflow, please refer to the command that uses DataflowRunner.

@yinshiyi , if you want to discuss further, please open a new issue with a few more details on which command you were using. That will be helpful for our team member to provide more support. Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants