Skip to content

Latest commit

 

History

History
78 lines (56 loc) · 2.31 KB

dataproc.md

File metadata and controls

78 lines (56 loc) · 2.31 KB

ConnectedReads Quick Start on Google Cloud

This is an explanation of how to launch ConnectedReads in Google Cloud.

Background

Google Cloud Dataproc (Cloud Dataproc) is a cloud-based managed Spark and Hadoop service offered on Google Cloud Platform.

Preliminaries

To access DataProc, plese install gsutil first. You can go to Google Cloud for installation guide.

For password-less deployment, your SSH key is required. Please refer to this link for acquiring your SSH Key.

Launch Cluster

gcloud beta dataproc clusters create my-connectedreads \
 --subnet default --zone us-west1-b \
 --num-workers 5 --worker-machine-type n1-highmem-32 \
 --image-version 1.2.59-deb9 \
 --initialization-actions gs://seqslab-deepvariant/scripts/initialization-on-dataproc.sh \
 --initialization-action-timeout 20m

Delete Cluster

gcloud beta dataproc clusters delete my-connectedreads

ConnectedReads Installation

Login to the Spark master

For password-less deployment, your SSH key (i.e. ~/.ssh/google_compute_engine) should be added by using ssh-add first. When the cluster has been launched completely, you can login the terminal of the master via Google Cloud Platform or the following command:

ssh-add -K ~/.ssh/google_compute_engine
gcloud compute ssh --ssh-flag="-A" my-connectedreads-m --zone="us-west1-b"

Note: if ssh-add is failed and the error message is like "Error connecting to agent: No such file or directory", please use the following command first.

ssh-agent bash

Download and Build ConnectedReads

ConnectedReads leverages mvn to build its package. Please clone ConnectedReads github repo. and use mvn to build all of modules followed by the following commands:

git clone https://github.com/atgenomix/connectedreads.git
cd connectedreads
mvn package

Run ConnectedReads

Please upload your FASTQ files into the Spark master and edit scripts/run.sh. Please refer to ConnectedReads WGS case study for more details.

Status Monitor

Please refer to Cluster Operation Portal to monitor the healthy status of YARN and HDFS.