This is an explanation of how to launch ConnectedReads in Google Cloud.
Google Cloud Dataproc (Cloud Dataproc) is a cloud-based managed Spark and Hadoop service offered on Google Cloud Platform.
To access DataProc, plese install gsutil
first. You can go to
Google Cloud
for installation guide.
For password-less deployment, your SSH key is required. Please refer to this link for acquiring your SSH Key.
gcloud beta dataproc clusters create my-connectedreads \
--subnet default --zone us-west1-b \
--num-workers 5 --worker-machine-type n1-highmem-32 \
--image-version 1.2.59-deb9 \
--initialization-actions gs://seqslab-deepvariant/scripts/initialization-on-dataproc.sh \
--initialization-action-timeout 20m
gcloud beta dataproc clusters delete my-connectedreads
For password-less deployment, your SSH key
(i.e. ~/.ssh/google_compute_engine) should be added by using ssh-add
first. When the cluster has been launched completely, you can login the
terminal of the master via Google Cloud Platform or the following
command:
ssh-add -K ~/.ssh/google_compute_engine
gcloud compute ssh --ssh-flag="-A" my-connectedreads-m --zone="us-west1-b"
Note: if ssh-add
is failed and the error message is like "Error
connecting to agent: No such file or directory", please use the
following command first.
ssh-agent bash
ConnectedReads leverages mvn to build its package. Please clone ConnectedReads github repo. and use mvn to build all of modules followed by the following commands:
git clone https://github.com/atgenomix/connectedreads.git
cd connectedreads
mvn package
Please upload your FASTQ files into the Spark master and edit scripts/run.sh. Please refer to ConnectedReads WGS case study for more details.
Please refer to Cluster Operation Portal to monitor the healthy status of YARN and HDFS.