Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --skip_markduplicates #205

Merged
merged 4 commits into from
May 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ jobs:
env:
NXF_ANSI_LOG: false
runs-on: ubuntu-latest
strategy:
matrix:
markduplicates: [--skip_markduplicates, '']
steps:
- uses: actions/checkout@v2
- name: Install Nextflow
Expand All @@ -69,10 +72,10 @@ jobs:
run: git clone --single-branch --branch sarek https://github.com/nf-core/test-datasets.git data
- name: Run germline test
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input data/testdata/tiny/normal
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input=false --step prepare_recalibration -resume
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input=false --step recalibrate -resume
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input=false --step variantCalling
nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input data/testdata/tiny/normal --saved_bam_mapped
nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input=false --step prepare_recalibration -resume
nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input=false --step recalibrate -resume
nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input=false --step variantCalling

minimal:
env:
Expand Down
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Piellorieppe is one of the main massif in the Sarek National Park.
- [#195](https://github.com/nf-core/sarek/pull/195), [#202](https://github.com/nf-core/sarek/pull/202) - Add `--save_bam_mapped` params to save mapped BAMs.
- [#197](https://github.com/nf-core/sarek/pull/197) - Add step `prepare_recalibration` to allow restart from DuplicatesMarked BAMs
- [#204](https://github.com/nf-core/sarek/pull/204) - Add step `Control-FREEC` to allow restart from pileup files
- [#205](https://github.com/nf-core/sarek/pull/205) - Add `--skip_markduplicates` to allow skipping the `MarkDuplicates` process

### Changed

Expand Down Expand Up @@ -61,6 +62,8 @@ Piellorieppe is one of the main massif in the Sarek National Park.
- [#164](https://github.com/nf-core/sarek/pull/164) - Update `gatk4-spark` from `4.1.4.1` to `4.1.6.0`
- [#180](https://github.com/nf-core/sarek/pull/180), [#195](https://github.com/nf-core/sarek/pull/195) - Improve minimal setting
- [#183](https://github.com/nf-core/sarek/pull/183), [#204](https://github.com/nf-core/sarek/pull/204) - Update `input.md` documentation
- [#197](https://github.com/nf-core/sarek/pull/197) - Output directory `DuplicateMarked` is now replaced by`DuplicatesMarked`
- [#204](https://github.com/nf-core/sarek/pull/204) - Output directory `controlFREEC` is now replaced by`Control-FREEC`

### Fixed

Expand Down Expand Up @@ -130,7 +133,6 @@ Piellorieppe is one of the main massif in the Sarek National Park.
- [#152](https://github.com/nf-core/sarek/pull/152) - Removed `Jenkinsfile`
- [#169](https://github.com/nf-core/sarek/pull/169) - Removed omicX from README
- [#181](https://github.com/nf-core/sarek/pull/181) - Remove duplicate code in `nextflow.config`
- [#197](https://github.com/nf-core/sarek/pull/197) - Output directory `DuplicateMarked` is now replaced by`DuplicatesMarked`

## [2.5.2] - Jåkkåtjkaskajekna

Expand Down
90 changes: 62 additions & 28 deletions docs/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,41 @@

## General information about the TSV files

Input files for Sarek can be specified using a TSV file given to the `--input` command.
There are different kinds of TSV files that can be used as input, depending on the input files available (fastq, uBAM, BAM...).
Input files for Sarek can be specified using a TSV (Tab Separated Values) file given to the `--input` command (Note, the delimiter is the tab (`\t`) character).
There are different kinds of TSV files that can be used as input, depending on the input files available (FASTQ, uBAM, BAM...).
For all possible TSV files, described in the next sections, here is an explanation of what the columns refer to:

- `subject` designates the subject, it should be the ID of the patient, and it must be unique for each patient, but one patient can have multiple samples (e.g. normal and tumor)
- `sex` are the sex chromosomes of the Patient, (XX or XY)
- `subject` designates the subject, it should be the ID of the subject, and it must be unique for each subject, but one subject can have multiple samples (e.g.
normal and tumor)
- `sex` are the sex chromosomes of the subject, (XX or XY)
- `status` is the status of the measured sample, (0 for Normal or 1 for Tumor)
- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged)
- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each subject, i.e.
a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged)
- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character
- `fastq1` is the path to the first pair of the fastq file
- `fastq2` is the path to the second pair of the fastq file
- `bam` is the path to the bam file
- `bai` is the path to the bam index file
- `fastq1` is the path to the first pair of the FASTQ file
- `fastq2` is the path to the second pair of the FASTQ file
- `bam` is the path to the BAM file
- `bai` is the path to the BAM index file
- `recaltable` is the path to the recalibration table
- `mpileup` is the path to the mpileup file

It is recommended to add the absolute path of the files, but relative path should also work.
Note, the delimiter is the tab (`\t`) character.

All examples are given for a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline variant calling results only.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline Variant Calling results only.

Sarek will output results in a different directory for each sample.
If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples.
Multiple TSV files can be specified if the path is enclosed in quotes.

Somatic variant calling output will be in a specific directory for each normal/tumor pair.
Output from Variant Calling and/or Annotation will be in a specific directory for each sample (or normal/tumor pair if applicable).

## Starting from the mapping step

When starting from the mapping step (`--step mapping`), the first step of Sarek, the input can have three different forms:

- A TSV file containing the sample metadata and the path to the fastq files.
- The Path to a directory containing the fastq files
- A TSV file containing the sample metadata and the path to the paired-end FASTQ files.
- The path to a directory containing the FASTQ files
- A TSV file containing the sample metadata and the path to the unmapped BAM (uBAM) files.

### Providing a TSV file with the path to FASTQ files
Expand All @@ -59,13 +60,13 @@ G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.
Input files for Sarek can be specified using the path to a FASTQ directory given to the `--input` command only with the `mapping` step.

```bash
nextflow run nf-core/sarek --input pathToDirectory ...
nextflow run nf-core/sarek --input /path/To/Directory ...
```

#### Input FASTQ file name best practices

The input folder, containing the FASTQ files for one subject (ID) should be organized into one sub-folder for every sample.
All fastq files for that sample should be collected here.
All FASTQ files for that sample should be collected here.

```text
ID
Expand All @@ -84,7 +85,7 @@ ID
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
```

Fastq filename structure:
FASTQ filename structure:

- `sample_lib_flowcell-index_lane_R1_1000.fastq.gz` and
- `sample_lib_flowcell-index_lane_R2_1000.fastq.gz`
Expand All @@ -96,7 +97,7 @@ Where:
- `flowcell` = identifier of flow cell for the sequencing run
- `lane` = identifier of the lane of the sequencing run

Read group information will be parsed from fastq file names according to this:
Read group information will be parsed from FASTQ file names according to this:

- `RGID` = "sample_lib_flowcell_index_lane"
- `RGPL` = "Illumina"
Expand All @@ -105,7 +106,7 @@ Read group information will be parsed from fastq file names according to this:

### Providing a TSV file with the paths to uBAM files

The TSV (Tab Separated Values) file for starting the mapping from uBAM files should contain the columns:
The TSV file for starting the mapping from uBAM files should contain the columns:

- `subject sex status sample lane bam`

Expand Down Expand Up @@ -133,10 +134,25 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai
```

When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked_no_table.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. Setting the step `--step prepare_recalibration` will automatically take this file as input.
When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process.
This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked_no_table.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
Using `--step prepare_recalibration` will automatically take this file as input.

Additionally, individual TSV files for each sample (`duplicates_marked_no_table_[SAMPLE].tsv`) can be found in the same directory.

If `--skip_markduplicates` has been specified, the TSV file for this step will be slightly different:

```text
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.bam pathToFiles/G15511.C09DFN.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.bam pathToFiles/G15511.D0ENMT.bai
```

When starting Sarek from the mapping step with `--skip_markduplicates`, a TSV file is generated automatically after the `Mapping` processes.
This TSV file is stored under `results/Preprocessing/TSV/mapped.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
Using `--step recalibrate --skip_markduplicates` will automatically take this file as input.

Additionally, individual TSV files for each sample (`mapped_[SAMPLE].tsv`) can be found in the same directory.

## Starting from the BAM recalibration step

To start from the recalibration step (`--step recalibrate`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the non recalibrated but already mapped BAM files.
Expand All @@ -147,14 +163,29 @@ The TSV needs to contain the following columns:
The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like:

```text
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G15511.C09DFN.md.bai pathToFiles/G15511.C09DFN.md.recal.table
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G15511.C09DFN.md.bai pathToFiles/G15511.C09DFN.recal.table
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.recal.table
```

When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. Setting the step `--step recalibrate` will automatically take this file as input.
When starting Sarek from the mapping step, a TSV file is generated automatically after the `BaseRecalibrator` processes.
This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
Using `--step recalibrate` will automatically take this file as input.

Additionally, individual TSV files for each sample (`duplicates_marked_[SAMPLE].tsv`) can be found in the same directory.

If `--skip_markduplicates --skip_markduplicates` has been specified, the TSV file for this step will be slightly different:

```text
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.bam pathToFiles/G15511.C09DFN.bai pathToFiles/G15511.C09DFN.recal.table
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.bam pathToFiles/G15511.D0ENMT.bai pathToFiles/G15511.D0ENMT.recal.table
```

When starting Sarek from the mapping step with `--skip_markduplicates`, a TSV file is generated automatically after the `BaseRecalibrator` processes.
This TSV file is stored under `results/Preprocessing/TSV/mapped_no_duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
Using `--step recalibrate` will automatically take this file as input.

Additionally, individual TSV files for each sample (`mapped_no_duplicates_marked_[SAMPLE].tsv`) can be found in the same directory.

## Starting from the variant calling step

A TSV file for a normal/tumor pair with recalibrated BAM files and their indexes can be provided to start Sarek from the variant calling step (`--step variantcalling`).
Expand All @@ -165,11 +196,13 @@ The TSV file should contain the columns:
Here is an example for two samples from the same subject:

```text
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToFiles/G15511.C09DFN.md.recal.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.recal.bam pathToFiles/G15511.C09DFN.recal.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.recal.bam pathToFiles/G15511.D0ENMT.recal.bai
```

When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files. Setting the step `--step variantcalling` will automatically take this file as input.
When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes.
This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files.
Using `--step variantcalling` will automatically take this file as input.

Additionally, individual TSV files for each sample (`recalibrated_[SAMPLE].tsv`) can be found in the same directory.

Expand All @@ -180,15 +213,16 @@ The TSV needs to contain the following columns:

- `subject sex status sample mpileup`

The same way, if you have non recalibrated BAMs and their indexes, you should use a structure like:
Here is an example for one normal/tumor pair from one subjects:

```text
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.pileup
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.pileup
```

When starting Sarek from the Control-FREEC step, a TSV file is generated automatically after the `mpileup` process.
This TSV file is stored under `results/VariantCalling/TSV/control-freec_mpileup.tsv` and can be used to restart Sarek from the mpileup files. Setting the step `--step Control-FREEC` will automatically take this file as input.
This TSV file is stored under `results/VariantCalling/TSV/control-freec_mpileup.tsv` and can be used to restart Sarek from the mpileup files.
Using `--step Control-FREEC` will automatically take this file as input.

Additionally, individual TSV files for each sample (`control-freec_mpileup_[SAMPLE].tsv`) can be found in the same directory.

Expand Down
5 changes: 5 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
- [--markdup_java_options](#--markdup_java_options)
- [--no_gatk_spark](#--no_gatk_spark)
- [--save_bam_mapped](#--save_bam_mapped)
- [--skip_markduplicates](#--skip_markduplicates)
- [Variant Calling](#variant-calling)
- [--ascat_ploidy](#--ascat_ploidy)
- [--ascat_purity](#--ascat_purity)
Expand Down Expand Up @@ -364,6 +365,10 @@ Use this to disable usage of GATK Spark implementation of their tools in local m

Will save mapped BAMs.

### --skip_markduplicates

Will skip MarkDuplicates. This params will also save the mapped BAMS, to enable restart from step `prepare_recalibration`

## Variant Calling

### --ascat_ploidy
Expand Down
Loading