Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.0 updates #121

Merged
merged 7 commits into from
May 9, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
sudo mv nextflow /usr/local/bin/

# Work around for the unexpected end of file error that github actions seems to get when downloading compressed
# files during pipelie execution
# files during pipeline execution
- name: Download data
run: |
wget -q https://raw.githubusercontent.com/nf-core/test-datasets/cutandrun/reference/genomes/hg38-chr20.fa.gz
Expand Down Expand Up @@ -81,7 +81,7 @@ jobs:
sudo mv nextflow /usr/local/bin/

# Work around for the unexpected end of file error that github actions seems to get when downloading compressed
# files during pipelie execution
# files during pipeline execution
- name: Download data
run: |
wget -q https://raw.githubusercontent.com/nf-core/test-datasets/cutandrun/reference/genomes/hg38-chr20.fa.gz
Expand Down Expand Up @@ -121,7 +121,7 @@ jobs:
nextflow self-update

# Work around for the unexpected end of file error that github actions seems to get when downloading compressed
# files during pipelie execution
# files during pipeline execution
- name: Download data
run: |
wget -q https://raw.githubusercontent.com/nf-core/test-datasets/cutandrun/reference/genomes/hg38-chr20.fa.gz
Expand Down
43 changes: 43 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,46 @@ Note, since the pipeline is now using Nextflow DSL2, each process will be run wi
> **NB:** Dependency has been **updated** if both old and new version information is present.
> **NB:** Dependency has been **added** if just the new version information is present.
> **NB:** Dependency has been **removed** if version information isn't present.

## [2.0] - 2022-05-15

### Major Changes

- [[#53](https://github.com/nf-core/cutandrun/issues/53)] - Complete redesign of the samplesheet input system. Controls are no longer hard-coded to `igg` and the process of assigning controls to samples has been simplified.
- [[#93](https://github.com/nf-core/cutandrun/issues/93)], [[#73](https://github.com/nf-core/cutandrun/issues/73)] - Additional sample normalisation options have been added. In addition to normalising using detected spike-in DNA, there are now several options for normalising against read depth instead as well as skipping normalisation entirely.
- [[#62](https://github.com/nf-core/cutandrun/issues/62)] - Added MACS2 as an optional peak caller. Peak calling can now be altered using the `peakcaller` variable. Both peak callers can be run using `--peakcaller SEACR,MACS2`, the primary caller is the first item in the list and will be used downstream while the secondary will be run and outputted to the results folder.
- [[#101](https://github.com/nf-core/cutandrun/issues/101)] - `v1.1` ran consensus peak calling at both the group level and for all samples. This was causing performance issues for larger sample sets. There is now a new `consensus_peak_mode` parameter that defaults to `group`. Consensus will only be run on all samples if this is changed to `all`.

### Enhancements

- Updated pipeline template to nf-core/tools `2.3.2`.
- Upgraded pipeline to support the new nf-core module configuration system.
- More robust CI testing. Over 213 tests now before any code is merged with the main code base.
- More control over which parts of the pipeline run. Explicit skipping has been implemented for every section of the pipeline.
- Added options for scaling control data before it is used to call peaks. This is especially useful when using read depth normalisation as this can sometimes result in few peaks being called due to high background levels.
- Added support for Bowtie2 large indexes.
- IGV auto-session builder now supports `gff` and `fna` file extensions.
- Bowtie2 alignment has been altered to run in `--end-to-end` mode only if trimming is skipped. If trimming is activated then it will run in `--local` mode.
- [[#88](https://github.com/nf-core/cutandrun/issues/88)] - Many processes have been optimized for resource utilization. Users will especially notice that single thread processes will now only request 1 core rather than 2.
- [[#63](https://github.com/nf-core/cutandrun/issues/63)] - Custom containers for python reporting have now been condensed into a single container and added to BioConda.
- [[#76](https://github.com/nf-core/cutandrun/issues/76)] - Standardized python versions across reporting modules.

### Fixes

- [[#120](https://github.com/nf-core/cutandrun/issues/120)] - DeepTools compute matrix/heatmaps now only runs if there are peaks detected.
- [[#99](https://github.com/nf-core/cutandrun/issues/99)] - Large upset plots were causing process crashes. Upset plots will now fail gracefully if the number of samples in the consensus group is more than 10.
- [[#95](https://github.com/nf-core/cutandrun/issues/95)] - Fixed FRIP calculation performance issues and crashes.

### Software dependencies

Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.

| Dependency | Old version | New version |
| ---------- | ----------- | ----------- |
| `samtools` | 1.14 | 1.15.1 |
| `bowtie2` | 2.4.2 | 2.4.4 |
| `picard` | 2.25.7 | 2.27.1 |

> **NB:** Dependency has been **updated** if both old and new version information is present.
> **NB:** Dependency has been **added** if just the new version information is present.
> **NB:** Dependency has been **removed** if version information isn't present.
2 changes: 1 addition & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ The merge function from [BEDtools](https://github.com/arq5x/bedtools2) is used t
![Peak calling - group consensus peak plot](images/consensus_peaks.png)
![Peak calling - group consensus peak plot](images/all_consensus_peaks.png)

##  Reporting
## Reporting

### Python reporting

Expand Down
42 changes: 36 additions & 6 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,17 +63,47 @@ nextflow pull nf-core/cutandrun

## Pipeline Configuration Options

Trimming
### Flow and Output Configuration

Saving, skipping flow options
There are some options detailed on the parameters page that are prefixed with `save`, `skip` or `only`. These are flow control options that allow for saving additional output to the results directory, skipping unwanted portions of the pipeline or running the pipeline up to a certain point, which can be useful for testing.

Deduplication
### De-duplication

Normalisation
CUT&RUN and CUT&Tag both integrate adapters into the vicinity of antibody-tethered enzymes, and the exact sites of integration are affected by the accessibility of surrounding DNA. Given these experimental parameters, it is expected that there are many fragments which share common starting and end positions; thus, such duplicates are generally valid but would be filtered out by de-duplication tools. However, there will be a fraction of fragments that are present due to PCR duplication that cannot be separated.

peak calling
Control samples such as those from IgG datasets have relatively high duplication rates due to non-specific interactions with the genome; therefore, it is appropriate to remove duplicates from control samples.

consensus peaks
The default for the pipeline therefore is to only run de-duplication on control samples. If it is suspected that there is a heavy fraction of PCR duplicates present in the primary samples then the parameter `dedup_target_reads` can be set using

`--dedup_target_reads true`

### Peak Normalisation

The default mode in the pipeline is to normalise stacked reads before peak calling for epitope abundance using spike-in normalisation.

Traditionally, E. coli DNA is carried along with bacterially-produced enzymes that are used in CUT&RUN and CUT&Tag experiments and gets tagmented non-specifically during the reaction. The fraction of total reads that map to the E.coli genome depends on the yield of epitope-targeted CUT&Tag, and so depends on the number of cells used and the abundance of that epitope in chromatin. Since a constant amount of protein is added to the reactions and brings along a fixed amount of E. coli DNA, E. coli reads can be used to normalize epitope abundance in a set of experiments.

Since the introduction of these techniques there are several factors that have reduced the usefulness of this type of normalisation in certain experimental conditions. Firstly, many commercially available kits now have very low levels of E.coli DNA in them, which therefore requires users to spike-in their own DNA for normalisation which is not always done. Secondly the normalisation approach is dependant on the cell count between samples being constant, which in our experience is quite difficult to achieve especially in tissue samples.

For these reasons we provide several other modes of normalisation based on read count; however, it should be noted that this form of normalisation is more simplistic and does not take into account epitope abundance. These normalisation modes are performed by [Deeptools bamCoverage](https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html), some are more relevant than others to this type of data, we recommend using CPM with a bin size of 1 as a default.

| Mode | Description |
| --------- | --------------------------------------------------------------------------------------------------------- |
| `Spikein` | The default mode which normalises by E. coli DNA. |
| `RPKM` | Reads Per Kilobase per Million mapped reads. More relevant for transcript based assays. |
| `CPM` | Counts Per Million mapped reads = number of reads per bin / number of mapped reads. Default bin size is 1 |
| `BPM` | number of reads per bin / sum of all reads per bin (in millions), |
| `None` | Disables normalisation. |

Normalisation mode can be changed by `normalisation_mode`

### Peak Calling

This pipeline currently provides peak calling via `SEACR` or `MACS2` using the `peakcaller` parameter. If control samples are provided in the sample sheet by default they will be used to normalise the called peaks against non-specific background noise. Control normalisation can be disabled using `use_control`. Additionally it may be necessary to scale control samples being used as background, especially when read count normalisation methods have been used at earlier stages in the pipeline. To scale the contol samples before peak calling, change the `igg_scale_factor` parameter to a number between 0-1. Multiple peak callers can be run by using comma separated values e.g. `--peakcaller SEACR,MACS2`, in this mode the primary peak caller is the first in the list and will be used for downstream processing; any additional peak callers will simply output to the results directory.

### Consensus Peaks

After peak calling, consensus peaks will be calculated based on merging peaks within the same groups. The number of replicates required for a valid peak can be changed using `replicate_threshold`. In some situations a user may which to call consensus peaks based on all samples, this can be configured by changing the `consensus_peak_mode` parameter from `group` to `all`.

### Reproducibility

Expand Down
4 changes: 3 additions & 1 deletion modules/local/modules/calculate_frip/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ process CALCULATE_FRIP {
label 'process_low'

conda (params.enable_conda ? "conda-forge::python=3.8.3 bioconda::deeptools=3.5.* bioconda::pysam=0.17.*" : null)
container "chrischeshire/nfcore-cutandrun:latest"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' :
'quay.io/biocontainers/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' }"

input:
tuple val(meta), path(bam), path(bai), path(bed)
Expand Down
4 changes: 3 additions & 1 deletion modules/local/modules/calculate_peak_reprod/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ process CALCULATE_PEAK_REPROD {
label 'process_ultralow'

conda (params.enable_conda ? "conda-forge::python=3.8.3 conda-forge::dask=2021.9.1 conda-forge::pandas=1.3.3" : null)
container "chrischeshire/nfcore-cutandrun:latest"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' :
'quay.io/biocontainers/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' }"

input:
tuple val(meta), path(bed)
Expand Down
4 changes: 3 additions & 1 deletion modules/local/modules/generate_reports/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ process GENERATE_REPORTS {
label 'process_ultralow'

conda (params.enable_conda ? "conda-forge::python=3.8.3 conda-forge::pandas=1.3.3" : null)
container "chrischeshire/nfcore-cutandrun:latest"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' :
'quay.io/biocontainers/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' }"

input:
path meta_data
Expand Down
4 changes: 3 additions & 1 deletion modules/local/modules/plot_consensus_peaks/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ process PLOT_CONSENSUS_PEAKS {
label 'process_min'

conda (params.enable_conda ? "conda-forge::python=3.8.3 conda-forge::numpy=1.20.* conda-forge::pandas=1.2.* conda-forge::upsetplot=0.4.4" : null)
container "chrischeshire/nfcore-cutandrun:latest"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' :
'quay.io/biocontainers/mulled-v2-f42a44964bca5225c7860882e231a7b5488b5485:47ef981087c59f79fdbcab4d9d7316e9ac2e688d-0' }"

input:
path(consensus_peaks)
Expand Down