Merge pull request #435 from grst/add-umi-tools

Add UMI tools
nf-core · Jul 27, 2020 · 4a0eeb4 · 4a0eeb4
2 parents e14e0d4 + 8c3b6ae
commit 4a0eeb4
Show file tree

Hide file tree

Showing 12 changed files with 326 additions and 49 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -23,7 +23,15 @@ jobs:
         # Nextflow versions: check pipeline minimum and current latest
         nxf_ver: ['19.10.0', '']
         aligner: ["--aligner 'hisat2'", "--aligner 'star'", "--pseudo_aligner 'salmon'"]
-        options: ['--skipQC', '--remove_rRNA', '--saveUnaligned', '--skipTrimming', '--star_index false']
+        options: 
+          - '--skipQC' 
+          - '--remove_rRNA'
+          - '--saveUnaligned'
+          - '--skipTrimming'
+          - '--star_index false'
+          - '--skip_rsem'
+          - '--with_umi'
+          - '--with_umi --skipTrimming --save_umi_intermediates --skip_rsem'
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@v2

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -20,6 +20,10 @@
 * Add option for `--additional_fasta` to provide ERCC spike-ins, transgenes such as GFP or CAR-T as additional sequences to align to [#419](https://github.com/nf-core/rnaseq/pull/419)
 * Updates awstest GitHub actions workflow with organization level secrets [#431](https://github.com/nf-core/rnaseq/pull/431/files)
 * Fix a bug where the RSEM reference could not be built [#436](https://github.com/nf-core/rnaseq/pull/436)
+* Implement UMI-based read deduplication [#435](https://github.com/nf-core/rnaseq/pull/435)
+* Fix a bug where gzipped references were not extracted when `--additional_fasta` was not specified [#435](https://github.com/nf-core/rnaseq/pull/435)
+* Fix a bug where merging of RSEM output would fail if only one fastq provided as input [#435](https://github.com/nf-core/rnaseq/pull/435)
+* Correct RSEM output name (was saving counts but calling them TPMs; now saving both properly labelled) [#435](https://github.com/nf-core/rnaseq/pull/435)
 * Fix typo reported for work-dir [#434](https://github.com/nf-core/rnaseq/issues/434)
 
 #### Updated Packages
@@ -32,6 +36,7 @@
 
 * Added `pigz` `2.3.4` for parallelized trim-galore support
 * Added `rsem` `1.3.3` for gene/transcript quantification
+* Added `umi_tools` `1.0.1` for UMI extraction and deduplication
 
 ## [Version 1.4.2](https://github.com/nf-core/rnaseq/releases/tag/1.4.2) - 2019-10-18
 

diff --git a/README.md b/README.md
@@ -32,6 +32,7 @@ generates counts relative to genes
 [dupRadar](https://bioconductor.org/packages/release/bioc/html/dupRadar.html),
 [Preseq](http://smithlabresearch.org/software/preseq/),
 [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html),
+[umi_tools](https://github.com/CGATOxford/UMI-tools),
 [MultiQC](http://multiqc.info/)). See the [output documentation](docs/output.md) for more details of the results.
 
 The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

diff --git a/assets/where_are_my_files.txt b/assets/where_are_my_files.txt
@@ -21,6 +21,12 @@ Specify to save trimmed FastQ files to the results directory.
 Save any downloaded or generated reference genome files to your results folder.
 These can then be used for future pipeline runs, reducing processing times.
 
+`--save_umi_intermediates`
+UMI extraction and deduplication generates intermediate FastQ filese with the UMIs 
+removed from the read and added to the FastQ header, and deduplicated BAM files. 
+Enabling this option saves these files to the `umitools` folder in the 
+results directory. 
+
 -----------------------------------
  Setting defaults in a config file
 -----------------------------------
@@ -31,6 +37,7 @@ the command line, you can save the following to your personal configuration file
 params.saveReference = true
 params.saveTrimmed = true
 params.saveAlignedIntermediates = true
+params.save_umi_intermediates = true
 
 For more help, see the following documentation:
 

diff --git a/bin/scrape_software_versions.py b/bin/scrape_software_versions.py
@@ -25,6 +25,7 @@
     'dupRadar': ['v_dupRadar.txt', r"(\S+)"],
     'edgeR': ['v_edgeR.txt', r"(\S+)"],
     'MultiQC': ['v_multiqc.txt', r"multiqc, version (\S+)"],
+    'umi_tools': ['v_umi_tools.txt', r"UMI-tools version: (\S+)"]
 }
 results = OrderedDict()
 results['nf-core/rnaseq'] = '<span style="color:#999999;">N/A</span>'

diff --git a/conf/test.config b/conf/test.config
@@ -29,4 +29,5 @@ params {
   gff = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gff'
   transcript_fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta'
   additional_fasta = "https://github.com/nf-core/test-datasets/raw/rnaseq/reference/gfp.fa"
+  umitools_bc_pattern = "NNNN"
 }
diff --git a/conf/test_gz.config b/conf/test_gz.config
@@ -32,4 +32,5 @@ params {
   star_index = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/star.tar.gz'
   salmon_index = 'https://github.com/grst/test-datasets/raw/rnaseq/reference/salmon_index.tar.gz'
   compressedReference = true
+  umitools_bc_pattern = "NNNN"
 }
diff --git a/docs/output.md b/docs/output.md
@@ -13,6 +13,7 @@ and processes data using the following steps:
   - [TrimGalore](#trimgalore)
   - [SortMeRNA](#sortmerna)
   - [STAR](#star)
+  - [UMI tools](#umi-tools)
   - [RSeQC](#rseqc)
     - [BAM stat](#bam-stat)
     - [Infer experiment](#infer-experiment)
@@ -26,6 +27,7 @@ and processes data using the following steps:
   - [dupRadar](#dupradar)
   - [Preseq](#preseq)
   - [featureCounts](#featurecounts)
+  - [RSEM](#RSEM)
   - [Salmon](#salmon)
   - [tximport](#tximport)
     - [Index files](#index-files)
@@ -106,6 +108,35 @@ The STAR section of the MultiQC report shows a bar plot with alignment rates: go
 - `unaligned/...`
   - Contains the unmapped reads that couldn't be mapped against the reference genome chosen. This is only available when the user specifically asks for `--saveUnaligned` output.
 
+## UMI tools
+
+UMI-tools deduplicates reads based on unique molecular identifiers (UMIs) to
+address PCR-bias. UMI-tools performs two independent steps:
+
+ 1. From the raw FastQ files, remove the UMIs and add them to the read name (`extract`)
+ 2. Deduplicate the aligned BAM files, based on the UMIs. (`dedup`)
+
+**Output directory: `results/umitools/extract`**
+
+For each sample, this directory contains a log file.
+If `--save_umi_intermediates` is specified, this directory additionally contains the
+FastQ files with the extracted UMI.
+
+**Output directory: `results/umitools/dedup`**
+
+For each sample, this directory contains summary statistics on the UMI duplication:
+
+```console
+<SAMPLE>_edit_distance.tsv
+<SAMPLE>_per_umi_per_position.tsv
+<SAMPLE>_per_umi.tsv
+```
+
+The content of these files is explained [in the UMI tools documentation](https://umi-tools.readthedocs.io/en/latest/reference/dedup.html#dedup-specific-options).
+
+If `--save_umi_intermediates` is specified, this directory additionally contains
+the deduplicated BAM file and its associated index.
+
 ## RSeQC
 
 RSeQC is a package of scripts designed to evaluate the quality of RNA seq data. You can find out more about the package at the [RSeQC website](http://rseqc.sourceforge.net/).
@@ -342,6 +373,10 @@ We also use featureCounts to count overlaps with different classes of features.
   - Main gene-level TPM matrix including all genes
 - `rsem_tpm_isoform.txt`
   - Main isoform-level TPM matrix
+- `rsem_transcript_counts_gene.txt`
+  - Main gene-level count matrix including all genes
+- `rsem_transcript_counts_isoform.txt`
+  - Main isoform-level count matrix
 - `*.stat`
   - RSEM statistics for each sample
 - `*.genes.results`

diff --git a/docs/usage.md b/docs/usage.md
@@ -43,6 +43,13 @@
     - [`--three_prime_clip_r2 [int]`](#threeprimeclipr2-int)
     - [`--trim_nextseq [int]`](#trimnextseq-int)
     - [`--skipTrimming`](#skiptrimming)
+  - [UMI handling](#umi-handling)
+    - [`--with_umi`](#withumi)
+    - [`--umitools_extract_method](#umitoolsextractmethod-str)
+    - [`--umitools_bc_pattern](#umitoolsbcpattern-str)
+    - [`--umitools_extract_extra](#umitoolsextractextra-str)
+    - [`--umitools_dedup_extra](#umitoolsdedupextra-str)
+    - [`--save_umi_intermediates](#saveumiintermediates)
   - [Ribosomal RNA removal](#ribosomal-rna-removal)
     - [`--removeRiboRNA`](#removeriborna)
     - [`--saveNonRiboRNAReads`](#savenonribornareads)
@@ -453,6 +460,37 @@ This enables the option --nextseq-trim=3'CUTOFF within Cutadapt in Trim Galore,
 
 This allows to skip the trimming process to save time when re-analyzing data that has been trimmed already.
 
+## UMI handling
+
+### `--with-umi`
+
+Enable UMI-based read deduplication
+
+### `--umitools_extract_method [str]`
+
+UMI pattern to use. Can be either `string` (default) or `regex`.
+More details in the [UMI-tools documentation](https://umi-tools.readthedocs.io/en/latest/reference/extract.html#extract-method)
+
+### `--umitools_bc_pattern [str]`
+
+The UMI barcode pattern to use. See the [UMI-tools documentation](https://umi-tools.readthedocs.io/en/latest/reference/extract.html#extract-method)
+for more details.
+
+Example `NNNNNN` = "First 6 nucleotides of read are the UMI".
+
+### `--umitools_extract_extra [str]`
+
+Extra arguments which are literally passed to the `umitools extract` step.
+
+### `--umitools_dedup_extra [str]`
+
+Extra arguments which are literally passed to the `umitools dedup` step.
+
+### `--save_umi_itermediates`
+
+If this option is specified, intermediate FastQ and BAM files produced
+by UMI-tools are stored to the output directory.
+
 ## Ribosomal RNA removal
 
 If rRNA removal is desired (for example, metatranscriptomics),

diff --git a/environment.yml b/environment.yml
@@ -38,3 +38,4 @@ dependencies:
   - trim-galore=0.6.4
   - pigz=2.3.4
   - rsem=1.3.3
+  - umi_tools=1.0.1