-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-RNAseq_bulk.Rmd
259 lines (203 loc) · 7.95 KB
/
09-RNAseq_bulk.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# Bulk RNAseq
This pipeline covers RNA-seq reads quality summary by `fastQC`,
alignment by `STAR`, quantification by `featureCounts` and quality
control by `RSeQC`.
```{r, warning=FALSE}
rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")
plotCWL(rnaseq_Sf)
```
The pipeline includes 10 steps, each step runs a single command as
follows:
```{r}
runs(rnaseq_Sf)
```
* `fastqc`: to run quality summary for raw fastqs with base command `fastqc`.
* `STAR`: to align fastqs with `STAR`.
* `sortBam`: to sort bam files with `samtools`.
* `samtools_index`: to index aligned bam file with `samtools`.
* `samtools_flagstat`: to summarize alignment flags with `samtools`.
* `featureCounts`: to quantify gene abundance with `featureCounts`.
* `gtfToGenePred`: to convert GTF annotation to 'genePred' format with `RSeQC`.
* `genePredToBed`: to convert 'genePred' annotation to 'bed' format with `RSeQC`.
* `r_distribution`: to run read distribution over genome features with `RSeQC`.
* `gCoverage`: to summarize read coverage over gene body with `RSeQC`.
The `rnaseq_Sf` pipepine output includes the QC result from `fastqc`
step, indexed bam files from `samtools_index` step, log and read
counts from `STAR` step, flag summary from `samtools_flagstat` step,
feature counts from `featureCounts` step, alignment QC results from
`RSeQC` steps.
```{r}
outputs(rnaseq_Sf)
```
## Prepare data
An RNASeq test data with paired-end fastq files for 6 samples can be
downloaded from [genomedata](http://genomedata.org/rnaseq-tutorial).
Create a local directory and follow the code below to download and
uncompress.
```{r, eval=FALSE}
dir.create("data/RNAseq", recursive = TRUE)
download.file("http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar",
"data/RNAseq/HBR_UHR_ERCC_ds_5pc.tar")
untar("data/RNAseq/HBR_UHR_ERCC_ds_5pc.tar",
exdir = "data/RNAseq/")
```
## Submit parallel jobs
Powered by `BiocParallel`,`Rcwl` supports parallel job running for
multiple samples using the `runCWLBatch` function.
The `BPPARAM` argument in `runCWLBatch()` defines the parallel
parameters. It can be defined by `BiocParallel::BatchtoolsParam`
function, where the `cluster` argument takes different values for
different cluster job manager, such as "multicore", "sge" and
"slurm". More details about available options can be checked by
`?BiocParallel::BatchtoolsParam`.
```{r}
library(BiocParallel)
```
```{r, eval=FALSE}
bpparam <- BatchtoolsParam(workers = 2, cluster = "sge",
template = batchtoolsTemplate("sge"))
```
In the following example, we are using "multicore" for the parallel
running.
```{r}
bpparam <- BatchtoolsParam(
workers = 2, cluster = "multicore")
```
When submitting parallel jobs using `runCWLBatch` function, two other
arguments: `inputList` and `paramList`, need to be defined.
The `inputList` argument is required to be a list of input parameter
values for samples that are to be computed parallelly. **NOTE** that
the names of the list must be consistent with the ids of input
parameters. In this example, they are:
* `in_seqfiles`: A list with the fastq files of each sample in each
element. The names of the list need to be defined and can be the
sample IDs. The length of the list will be the same as the number of
samples, so that the list of samples can be assigned to different
nodes for parallel computing. Here we only use 2 samples for
demonstration purposes.
* `in_prefix`: A list of sample IDs.
```{r}
files <- normalizePath(list.files("data/RNAseq/", ".gz",
full.names = TRUE))[1:4]
files <- tapply(files, substring(basename(files), 1, 8), as.list)
inputList <- list(in_seqfiles = files,
in_prefix = as.list(names(files)))
```
The `paramList` argument is required to be a list of input parameter
values that are to be shared for all parallelly running samples. In
this example, they are:
* `in_genomeDir`: The reference genome indexes for STAR.
* `in_GTFfile`: The gene annotation file in GTF format.
* `in_runThreadN`: The number of threads to run for each job.
```{r}
paramList <- list(
in_genomeDir = "data/resources/GRCh38_chr22",
in_GTFfile = "data/resources/GRCh38_chr22/gencode.v32.annotation_chr22.gtf",
in_runThreadN = 2
)
```
Here we can also modify the default argument values in some steps of a
pipeline. For example,
```{r}
arguments(rnaseq_Sf, "STAR")[1:2]
arguments(rnaseq_Sf, "STAR")[[2]] <- "2"
arguments(rnaseq_Sf, "STAR")[1:2]
```
Now that the fastqc files of each sample will be submitted to
different nodes to run the whole pipeline parallelly.
```{r, eval=FALSE}
res <- runCWLBatch(cwl = rnaseq_Sf,
outdir = "output/RNAseq_bulk",
inputList = inputList,
paramList = paramList,
BPPARAM = bpparam,
showLog = TRUE)
```
Pipeline results are collected in the output directory (defined in
`outdir`) for each sample.
```{r}
dir("output/RNAseq_bulk")
```
## QC Summary
The tool `multiqc` can aggregate results from the multiple outputs of
the pipeline and generate a single page report, which also was
implemented in the `RcwlPipelines` package:
```{r, eval=FALSE}
multiqc$dir <- "output/RNASeq_bulk"
multiqc
```
We can also run the tool using `Rcwl` locally with the option `docker
= TRUE`:
```{r, eval=FALSE}
runCWL(multiqc, stderr = "", Args = "--preserve-entire-environment", docker = FALSE)
```
## Abundances summary
Here we use the _R/Bioconductor_ package `edgeR` functions to
calculate the RPKM and CPM abundances.
```{r}
countfiles <- list.files("output/RNAseq_bulk", "featureCounts.txt$",
recursive = TRUE, full.names = TRUE)
samples <- basename(dirname(countfiles))
rExp <- function(countfile){
count1 <- read.table(countfile, header = TRUE)[, c(1,6,7)]
rpkm1 <- edgeR::rpkm(count1[,3,drop=F], gene.length = count1$Length)
cpm1 <- edgeR::cpm(count1[,3])
count1 <- data.frame(count1, rpkm1, cpm1)
colnames(count1)[3:5] <- c("count", "rpkm", "cpm")
return(count1)
}
```
```{r}
head(rExp(countfiles[1]))
```
We combine the files into one file, and then the data is ready for
statistical analysis using _R/Bioconductor_ packages, such as `DESeq2`
or `edgeR`.
```{r, eval=FALSE}
for(i in 1:length(samples)) {
exp1 <- rExp(countfiles[i])
write.table(exp1, file = paste0("output/RNAseq_bulk", samples[i],
"/", samples[i], "_abundance.tsv"),
row.names = FALSE, quote = FALSE, sep = "\t")
}
```
## transcriptome quantification
There are many tools available for transcriptome quantification, such
as kallisto, StringTie, salmon, and Trinity. Here show the usage of
`Kallisto` and `salmon`.
### Kallisto
The kallisto is a tool to quantify transcript abundance with raw fastq
reads and indexed reference transcriptomes. The `Rcwl` tool of
`kallisto_index` below is used to build an index file from reference
transcriptome fasta.
```{r}
kallisto_index <- cwlLoad("tl_kallisto_index")
```
```{r, eval=FALSE}
inputs(kallisto_index)
kallisto_index$fasta <- "data/resources/gencode.v33.transcripts.fa"
kallisto_index$index <- "gencode.v33.transcripts_kallisto"
runCWL(kallisto_index, outdir = "data/resources/kallisto", showLog = TRUE)
```
The `Rcwl` tool `kallisto_quant` runs the quantification algorithm to
estimate the transcripts expression abundances.
```{r}
kallisto_quant <- cwlLoad("tl_kallisto_quant")
```
```{r, eval=FALSE}
inputList <- list(fastq = files)
paramList <- list(index = "data/resources/gencode.v33.transcripts_kallisto",
threads = 16)
bpparam <- BatchtoolsParam(workers = length(samples),
cluster = "multicore")
res2 <- runCWLBatch(kallisto_quant, outdir = "output/RNAseq_bulk/kallisto",
inputList, paramList,
BPPARAM = bpparam,
log = TRUE, logdir = ".")
```
Then the tool results are collected here:
```{r}
list.files("output/RNAseq_bulk/kalisto", "abundance")
```
### salmon
To be added.