-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-RcwlPipelines.Rmd
210 lines (162 loc) · 6.05 KB
/
06-RcwlPipelines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
# RcwlPipelines
## Rcwl recipes and CWL scripts
The _R_ scripts to build the CWL tools and pipelines are now residing
in a dedicated [GitHub
repository](https://github.com/rworkflow/RcwlRecipes), which is
intended to be a community effort to collect and contribute
Bioinformatics tools and pipelines using `Rcwl` and CWL. Script names
are prefixed with `tl_` and `pl_` for tools and pipelines
respectively.
## RcwlPipelines core functions
### `cwlUpdate`
The `cwlUpdate` function syncs the current `Rcwl` recipes and returns
a `cwlHub` object which contains the most updated `Rcwl` recipes. The
`mcols()` function returns all related information about each
available tool or pipeline. Currently, we have integrated 113 command
line tools and 26 pipelines.
The recipes will be locally cached, so users don't need to call
`cwlUpdate` every time unless they want to use a tool/pipeline that is
newly added to `RcwlPipelines`.
```{r, message=FALSE}
atls <- cwlUpdate(branch = "dev") ## sync the tools/pipelines.
atls
```
Currently, we have integrated `r sum(Type(atls)=="tool")` command
line tools and `r sum(Type(atls)=="pipeline")` pipelines.
```
table(mcols(atls)$Type)
```
We can also get the commands and docker containers for specific tool or pipeline.
```{r}
mcols(atls)[, c("Command", "Container")]
```
### `cwlSearch`
We can use (multiple) keywords to search for specific tools/pipelines
of interest, which internally search the `mcols` of "rname", "rpath",
"fpath", "Command" and "Containers". Here we show how to search the
alignment tool `bwa mem`.
```{r}
t1 <- cwlSearch(c("bwa", "mem"))
t1
mcols(t1)
```
### `cwlLoad`
The last core function `cwlLoad` loads the `Rcwl` tool/pipeline into
the _R_ working environment. The code below loads the tool with a
user-defined name `bwa` to do the read alignment.
```{r}
bwa <- cwlLoad(title(t1)[1]) ## "tl_bwa"
bwa <- cwlLoad(mcols(t1)$fpath[1]) ## equivalent to the above.
bwa
```
Now the _R_ tool of `bwa` is ready to use.
## Tool/pipeline customization
To fit users' specific needs,the existing tool or pipline can be
easily customized. Here we use the `rnaseq_Sf` pipeline to demonstrate
how to access and change the arguments of a specific tool inside a
pipeline. This pipeline covers RNA-seq reads quality summary by
`fastQC`, alignment by `STAR`, quantification by `featureCounts` and
quality control by `RSeQC`.
```{r, warning=FALSE}
rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")
plotCWL(rnaseq_Sf)
```
There are many default arguments defined for the tool of `STAR` inside
the pipeline. Users might want to change some of them. For example, we
can change the value for `--outFilterMismatchNmax` argument from 2 to
5 for longer reads.
```{r}
arguments(rnaseq_Sf, "STAR")[5:6]
arguments(rnaseq_Sf, "STAR")[[6]] <- 5
arguments(rnaseq_Sf, "STAR")[5:6]
```
We can also change the docker image for a specific tool (e.g., to a
specific version). First, we search for all available docker images
for `STAR` in biocontainers repository. The Source server could be
[quay](https://quay.io/) or [dockerhub](https://hub.docker.com).
```{r}
searchContainer("STAR", repo = "biocontainers", source = "quay")
```
Then, we can change the `STAR` version into 2.7.8a (tag name: 2.7.8a--0).
```{r}
requirements(rnaseq_Sf, "STAR")[[1]]
requirements(rnaseq_Sf, "STAR")[[1]] <- requireDocker(
docker = "quay.io/biocontainers/star:2.7.8a--0")
requirements(rnaseq_Sf, "STAR")[[1]]
```
## Build a pipeline
We can build a pipline using the available tools. Here we demonstrate
how to build a simple alignment pipeline with mapping and marking
duplicates.
First, we check whether the required tools (bwa, samtools and picard
markduplicates) are available in `RcwlPipelines`.
```{r}
tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup",
type = "tool")
tls
```
Then we load all the required tools.
```{r}
bwa <- cwlLoad("tl_bwa")
bwa_index <- cwlLoad("tl_bwa_index")
markdup <- cwlLoad("tl_markdup")
sam2bam <- cwlLoad("tl_sam2bam")
samtools_index <- cwlLoad("tl_samtools_index")
sortBam <- cwlLoad("tl_sortBam")
```
Next, we will need to define the input parameters for the pipeline
(instead of for each tool).
```{r}
p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "string")
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")
```
Then we define the pipeline steps, to connect the inputs and outputs
of each tool to form a pipeline.
```{r}
## bwa
s1 <- cwlStep(id = "bwa", run = bwa,
In = list(threads = "threads",
RG = "RG",
Ref = "Ref",
FQ1 = "FQ1",
FQ2 = "FQ2"))
## sam to bam
s2 <- cwlStep(id = "sam2bam", run = sam2bam,
In = list(sam = "bwa/sam"))
## sort bam
s3 <- cwlStep(id = "sortBam", run = sortBam,
In = list(bam = "sam2bam/bam"))
## mark duplicates
s4 <- cwlStep(id = "markdup", run = markdup,
In = list(ibam = "sortBam/sbam",
obam = list(
valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
matrix = list(
valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s5 <- cwlStep(id = "idxBam", run = samtools_index,
In = list(bam = "markdup/mBam"))
```
Last, we will define the pipeline outputs and connect all the above
defined steps into a new pipeline.
```{r}
req1 <- requireStepInputExpression()
req2 <- requireJS()
## outputs
o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## cwlWorkflow
Align <- cwlWorkflow(requirements = list(req1, req2),
inputs = InputParamList(p1, p2, p3, p4, p5),
outputs = OutputParamList(o1, o2))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4 + s5
```
Now the pipeline is successfully built and ready for use. We can
visualize the pipeline with `plotCWL` from the `Rcwl` package.
```{r}
plotCWL(Align)
```