Skip to content

Commit

Permalink
Merge pull request #435 from grst/add-umi-tools
Browse files Browse the repository at this point in the history
Add UMI tools
  • Loading branch information
grst authored Jul 27, 2020
2 parents e14e0d4 + 8c3b6ae commit 4a0eeb4
Show file tree
Hide file tree
Showing 12 changed files with 326 additions and 49 deletions.
10 changes: 9 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,15 @@ jobs:
# Nextflow versions: check pipeline minimum and current latest
nxf_ver: ['19.10.0', '']
aligner: ["--aligner 'hisat2'", "--aligner 'star'", "--pseudo_aligner 'salmon'"]
options: ['--skipQC', '--remove_rRNA', '--saveUnaligned', '--skipTrimming', '--star_index false']
options:
- '--skipQC'
- '--remove_rRNA'
- '--saveUnaligned'
- '--skipTrimming'
- '--star_index false'
- '--skip_rsem'
- '--with_umi'
- '--with_umi --skipTrimming --save_umi_intermediates --skip_rsem'
steps:
- name: Check out pipeline code
uses: actions/checkout@v2
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@
* Add option for `--additional_fasta` to provide ERCC spike-ins, transgenes such as GFP or CAR-T as additional sequences to align to [#419](https://github.com/nf-core/rnaseq/pull/419)
* Updates awstest GitHub actions workflow with organization level secrets [#431](https://github.com/nf-core/rnaseq/pull/431/files)
* Fix a bug where the RSEM reference could not be built [#436](https://github.com/nf-core/rnaseq/pull/436)
* Implement UMI-based read deduplication [#435](https://github.com/nf-core/rnaseq/pull/435)
* Fix a bug where gzipped references were not extracted when `--additional_fasta` was not specified [#435](https://github.com/nf-core/rnaseq/pull/435)
* Fix a bug where merging of RSEM output would fail if only one fastq provided as input [#435](https://github.com/nf-core/rnaseq/pull/435)
* Correct RSEM output name (was saving counts but calling them TPMs; now saving both properly labelled) [#435](https://github.com/nf-core/rnaseq/pull/435)
* Fix typo reported for work-dir [#434](https://github.com/nf-core/rnaseq/issues/434)

#### Updated Packages
Expand All @@ -32,6 +36,7 @@

* Added `pigz` `2.3.4` for parallelized trim-galore support
* Added `rsem` `1.3.3` for gene/transcript quantification
* Added `umi_tools` `1.0.1` for UMI extraction and deduplication

## [Version 1.4.2](https://github.com/nf-core/rnaseq/releases/tag/1.4.2) - 2019-10-18

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ generates counts relative to genes
[dupRadar](https://bioconductor.org/packages/release/bioc/html/dupRadar.html),
[Preseq](http://smithlabresearch.org/software/preseq/),
[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html),
[umi_tools](https://github.com/CGATOxford/UMI-tools),
[MultiQC](http://multiqc.info/)). See the [output documentation](docs/output.md) for more details of the results.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
Expand Down
7 changes: 7 additions & 0 deletions assets/where_are_my_files.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ Specify to save trimmed FastQ files to the results directory.
Save any downloaded or generated reference genome files to your results folder.
These can then be used for future pipeline runs, reducing processing times.

`--save_umi_intermediates`
UMI extraction and deduplication generates intermediate FastQ filese with the UMIs
removed from the read and added to the FastQ header, and deduplicated BAM files.
Enabling this option saves these files to the `umitools` folder in the
results directory.

-----------------------------------
Setting defaults in a config file
-----------------------------------
Expand All @@ -31,6 +37,7 @@ the command line, you can save the following to your personal configuration file
params.saveReference = true
params.saveTrimmed = true
params.saveAlignedIntermediates = true
params.save_umi_intermediates = true

For more help, see the following documentation:

Expand Down
1 change: 1 addition & 0 deletions bin/scrape_software_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
'dupRadar': ['v_dupRadar.txt', r"(\S+)"],
'edgeR': ['v_edgeR.txt', r"(\S+)"],
'MultiQC': ['v_multiqc.txt', r"multiqc, version (\S+)"],
'umi_tools': ['v_umi_tools.txt', r"UMI-tools version: (\S+)"]
}
results = OrderedDict()
results['nf-core/rnaseq'] = '<span style="color:#999999;">N/A</span>'
Expand Down
1 change: 1 addition & 0 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ params {
gff = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gff'
transcript_fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta'
additional_fasta = "https://github.com/nf-core/test-datasets/raw/rnaseq/reference/gfp.fa"
umitools_bc_pattern = "NNNN"
}
1 change: 1 addition & 0 deletions conf/test_gz.config
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,5 @@ params {
star_index = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/star.tar.gz'
salmon_index = 'https://github.com/grst/test-datasets/raw/rnaseq/reference/salmon_index.tar.gz'
compressedReference = true
umitools_bc_pattern = "NNNN"
}
35 changes: 35 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ and processes data using the following steps:
- [TrimGalore](#trimgalore)
- [SortMeRNA](#sortmerna)
- [STAR](#star)
- [UMI tools](#umi-tools)
- [RSeQC](#rseqc)
- [BAM stat](#bam-stat)
- [Infer experiment](#infer-experiment)
Expand All @@ -26,6 +27,7 @@ and processes data using the following steps:
- [dupRadar](#dupradar)
- [Preseq](#preseq)
- [featureCounts](#featurecounts)
- [RSEM](#RSEM)
- [Salmon](#salmon)
- [tximport](#tximport)
- [Index files](#index-files)
Expand Down Expand Up @@ -106,6 +108,35 @@ The STAR section of the MultiQC report shows a bar plot with alignment rates: go
- `unaligned/...`
- Contains the unmapped reads that couldn't be mapped against the reference genome chosen. This is only available when the user specifically asks for `--saveUnaligned` output.

## UMI tools

UMI-tools deduplicates reads based on unique molecular identifiers (UMIs) to
address PCR-bias. UMI-tools performs two independent steps:

1. From the raw FastQ files, remove the UMIs and add them to the read name (`extract`)
2. Deduplicate the aligned BAM files, based on the UMIs. (`dedup`)

**Output directory: `results/umitools/extract`**

For each sample, this directory contains a log file.
If `--save_umi_intermediates` is specified, this directory additionally contains the
FastQ files with the extracted UMI.

**Output directory: `results/umitools/dedup`**

For each sample, this directory contains summary statistics on the UMI duplication:

```console
<SAMPLE>_edit_distance.tsv
<SAMPLE>_per_umi_per_position.tsv
<SAMPLE>_per_umi.tsv
```

The content of these files is explained [in the UMI tools documentation](https://umi-tools.readthedocs.io/en/latest/reference/dedup.html#dedup-specific-options).

If `--save_umi_intermediates` is specified, this directory additionally contains
the deduplicated BAM file and its associated index.

## RSeQC

RSeQC is a package of scripts designed to evaluate the quality of RNA seq data. You can find out more about the package at the [RSeQC website](http://rseqc.sourceforge.net/).
Expand Down Expand Up @@ -342,6 +373,10 @@ We also use featureCounts to count overlaps with different classes of features.
- Main gene-level TPM matrix including all genes
- `rsem_tpm_isoform.txt`
- Main isoform-level TPM matrix
- `rsem_transcript_counts_gene.txt`
- Main gene-level count matrix including all genes
- `rsem_transcript_counts_isoform.txt`
- Main isoform-level count matrix
- `*.stat`
- RSEM statistics for each sample
- `*.genes.results`
Expand Down
38 changes: 38 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,13 @@
- [`--three_prime_clip_r2 [int]`](#threeprimeclipr2-int)
- [`--trim_nextseq [int]`](#trimnextseq-int)
- [`--skipTrimming`](#skiptrimming)
- [UMI handling](#umi-handling)
- [`--with_umi`](#withumi)
- [`--umitools_extract_method](#umitoolsextractmethod-str)
- [`--umitools_bc_pattern](#umitoolsbcpattern-str)
- [`--umitools_extract_extra](#umitoolsextractextra-str)
- [`--umitools_dedup_extra](#umitoolsdedupextra-str)
- [`--save_umi_intermediates](#saveumiintermediates)
- [Ribosomal RNA removal](#ribosomal-rna-removal)
- [`--removeRiboRNA`](#removeriborna)
- [`--saveNonRiboRNAReads`](#savenonribornareads)
Expand Down Expand Up @@ -453,6 +460,37 @@ This enables the option --nextseq-trim=3'CUTOFF within Cutadapt in Trim Galore,

This allows to skip the trimming process to save time when re-analyzing data that has been trimmed already.

## UMI handling

### `--with-umi`

Enable UMI-based read deduplication

### `--umitools_extract_method [str]`

UMI pattern to use. Can be either `string` (default) or `regex`.
More details in the [UMI-tools documentation](https://umi-tools.readthedocs.io/en/latest/reference/extract.html#extract-method)

### `--umitools_bc_pattern [str]`

The UMI barcode pattern to use. See the [UMI-tools documentation](https://umi-tools.readthedocs.io/en/latest/reference/extract.html#extract-method)
for more details.

Example `NNNNNN` = "First 6 nucleotides of read are the UMI".

### `--umitools_extract_extra [str]`

Extra arguments which are literally passed to the `umitools extract` step.

### `--umitools_dedup_extra [str]`

Extra arguments which are literally passed to the `umitools dedup` step.

### `--save_umi_itermediates`

If this option is specified, intermediate FastQ and BAM files produced
by UMI-tools are stored to the output directory.

## Ribosomal RNA removal

If rRNA removal is desired (for example, metatranscriptomics),
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@ dependencies:
- trim-galore=0.6.4
- pigz=2.3.4
- rsem=1.3.3
- umi_tools=1.0.1
Loading

0 comments on commit 4a0eeb4

Please sign in to comment.