nf-core · maxulysse · May 18, 2020 · May 18, 2020 · May 18, 2020 · May 18, 2020
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -54,6 +54,9 @@ jobs:
     env:
       NXF_ANSI_LOG: false
     runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        markduplicates: [--skip_markduplicates, '']
     steps:
       - uses: actions/checkout@v2
       - name: Install Nextflow
@@ -69,10 +72,10 @@ jobs:
         run: git clone --single-branch --branch sarek https://github.com/nf-core/test-datasets.git data
       - name: Run germline test
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input data/testdata/tiny/normal
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input=false --step prepare_recalibration -resume
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input=false --step recalibrate -resume
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input=false --step variantCalling
+          nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input data/testdata/tiny/normal --saved_bam_mapped
+          nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input=false --step prepare_recalibration -resume
+          nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input=false --step recalibrate -resume
+          nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.markduplicates }} --input=false --step variantCalling
 
   minimal:
     env:

@@ -31,6 +31,7 @@ Piellorieppe is one of the main massif in the Sarek National Park.
 - [#195](https://github.com/nf-core/sarek/pull/195), [#202](https://github.com/nf-core/sarek/pull/202) - Add `--save_bam_mapped` params to save mapped BAMs.
 - [#197](https://github.com/nf-core/sarek/pull/197) - Add step `prepare_recalibration` to allow restart from DuplicatesMarked BAMs
 - [#204](https://github.com/nf-core/sarek/pull/204) - Add step `Control-FREEC` to allow restart from pileup files
+- [#205](https://github.com/nf-core/sarek/pull/205) - Add `--skip_markduplicates` to allow skipping the `MarkDuplicates` process
 
 ### Changed
 
@@ -61,6 +62,8 @@ Piellorieppe is one of the main massif in the Sarek National Park.
 - [#164](https://github.com/nf-core/sarek/pull/164) - Update `gatk4-spark` from `4.1.4.1` to `4.1.6.0`
 - [#180](https://github.com/nf-core/sarek/pull/180), [#195](https://github.com/nf-core/sarek/pull/195) - Improve minimal setting
 - [#183](https://github.com/nf-core/sarek/pull/183), [#204](https://github.com/nf-core/sarek/pull/204) - Update `input.md` documentation
+- [#197](https://github.com/nf-core/sarek/pull/197) - Output directory `DuplicateMarked` is now replaced by`DuplicatesMarked`
+- [#204](https://github.com/nf-core/sarek/pull/204) - Output directory `controlFREEC` is now replaced by`Control-FREEC`
 
 ### Fixed
 
@@ -130,7 +133,6 @@ Piellorieppe is one of the main massif in the Sarek National Park.
 - [#152](https://github.com/nf-core/sarek/pull/152) - Removed `Jenkinsfile`
 - [#169](https://github.com/nf-core/sarek/pull/169) - Removed omicX from README
 - [#181](https://github.com/nf-core/sarek/pull/181) - Remove duplicate code in `nextflow.config`
-- [#197](https://github.com/nf-core/sarek/pull/197) - Output directory `DuplicateMarked` is now replaced by`DuplicatesMarked`
 
 ## [2.5.2] - Jåkkåtjkaskajekna
 

@@ -2,40 +2,41 @@
 
 ## General information about the TSV files
 
-Input files for Sarek can be specified using a TSV file given to the `--input` command.
-There are different kinds of TSV files that can be used as input, depending on the input files available (fastq, uBAM, BAM...).
+Input files for Sarek can be specified using a TSV (Tab Separated Values) file given to the `--input` command (Note, the delimiter is the tab (`\t`) character).
+There are different kinds of TSV files that can be used as input, depending on the input files available (FASTQ, uBAM, BAM...).
 For all possible TSV files, described in the next sections, here is an explanation of what the columns refer to:
 
-- `subject` designates the subject, it should be the ID of the patient, and it must be unique for each patient, but one patient can have multiple samples (e.g. normal and tumor)
-- `sex` are the sex chromosomes of the Patient, (XX or XY)
+- `subject` designates the subject, it should be the ID of the subject, and it must be unique for each subject, but one subject can have multiple samples (e.g.
+normal and tumor)
+- `sex` are the sex chromosomes of the subject, (XX or XY)
 - `status` is the status of the measured sample, (0 for Normal or 1 for Tumor)
-- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged)
+- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each subject, i.e.
+a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged)
 - `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character
-- `fastq1` is the path to the first pair of the fastq file
-- `fastq2` is the path to the second pair of the fastq file
-- `bam` is the path to the bam file
-- `bai` is the path to the bam index file
+- `fastq1` is the path to the first pair of the FASTQ file
+- `fastq2` is the path to the second pair of the FASTQ file
+- `bam` is the path to the BAM file
+- `bai` is the path to the BAM index file
 - `recaltable` is the path to the recalibration table
 - `mpileup` is the path to the mpileup file
 
 It is recommended to add the absolute path of the files, but relative path should also work.
-Note, the delimiter is the tab (`\t`) character.
 
 All examples are given for a normal/tumor pair.
-If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline variant calling results only.
+If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline Variant Calling results only.
 
 Sarek will output results in a different directory for each sample.
 If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples.
 Multiple TSV files can be specified if the path is enclosed in quotes.
 
-Somatic variant calling output will be in a specific directory for each normal/tumor pair.
+Output from Variant Calling and/or Annotation will be in a specific directory for each sample (or normal/tumor pair if applicable).
 
 ## Starting from the mapping step
 
 When starting from the mapping step (`--step mapping`), the first step of Sarek, the input can have three different forms:
 
-- A TSV file containing the sample metadata and the path to the fastq files.
-- The Path to a directory containing the fastq files
+- A TSV file containing the sample metadata and the path to the paired-end FASTQ files.
+- The path to a directory containing the FASTQ files
 - A TSV file containing the sample metadata and the path to the unmapped BAM (uBAM) files.
 
 ### Providing a TSV file with the path to FASTQ files
@@ -59,13 +60,13 @@ G15511    XX    1    D0ENMT    D0ENM_2    pathToFiles/D0ENMACXX111207.2_1.fastq.
 Input files for Sarek can be specified using the path to a FASTQ directory given to the `--input` command only with the `mapping` step.
 
 ```bash
-nextflow run nf-core/sarek --input pathToDirectory ...
+nextflow run nf-core/sarek --input /path/To/Directory ...
 ```
 
 #### Input FASTQ file name best practices
 
 The input folder, containing the FASTQ files for one subject (ID) should be organized into one sub-folder for every sample.
-All fastq files for that sample should be collected here.
+All FASTQ files for that sample should be collected here.
 
 ```text
 ID
@@ -84,7 +85,7 @@ ID
 +------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
 ```
 
-Fastq filename structure:
+FASTQ filename structure:
 
 - `sample_lib_flowcell-index_lane_R1_1000.fastq.gz` and
 - `sample_lib_flowcell-index_lane_R2_1000.fastq.gz`
@@ -96,7 +97,7 @@ Where:
 - `flowcell` = identifier of flow cell for the sequencing run
 - `lane` = identifier of the lane of the sequencing run
 
-Read group information will be parsed from fastq file names according to this:
+Read group information will be parsed from FASTQ file names according to this:
 
 - `RGID` = "sample_lib_flowcell_index_lane"
 - `RGPL` = "Illumina"
@@ -105,7 +106,7 @@ Read group information will be parsed from fastq file names according to this:
 
 ### Providing a TSV file with the paths to uBAM files
 
-The TSV (Tab Separated Values) file for starting the mapping from uBAM files should contain the columns:
+The TSV file for starting the mapping from uBAM files should contain the columns:
 
 - `subject sex status sample lane bam`
 
@@ -133,10 +134,25 @@ G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.md.bam    pathToFiles/G
 G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.md.bam    pathToFiles/G15511.D0ENMT.md.bai
 ```
 
-When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked_no_table.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. Setting the step `--step prepare_recalibration` will automatically take this file as input.
+When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process.
+This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked_no_table.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
+Using `--step prepare_recalibration` will automatically take this file as input.
 
 Additionally, individual TSV files for each sample (`duplicates_marked_no_table_[SAMPLE].tsv`) can be found in the same directory.
 
+If `--skip_markduplicates` has been specified, the TSV file for this step will be slightly different:
+
+```text
+G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.bam    pathToFiles/G15511.C09DFN.bai
+G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.bam    pathToFiles/G15511.D0ENMT.bai
+```
+
+When starting Sarek from the mapping step with `--skip_markduplicates`, a TSV file is generated automatically after the `Mapping` processes.
+This TSV file is stored under `results/Preprocessing/TSV/mapped.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
+Using `--step recalibrate --skip_markduplicates` will automatically take this file as input.
+
+Additionally, individual TSV files for each sample (`mapped_[SAMPLE].tsv`) can be found in the same directory.
+
 ## Starting from the BAM recalibration step
 
 To start from the recalibration step (`--step recalibrate`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the non recalibrated but already mapped BAM files.
@@ -147,14 +163,29 @@ The TSV needs to contain the following columns:
 The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like:
 
 ```text
-G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.md.bam    pathToFiles/G15511.C09DFN.md.bai pathToFiles/G15511.C09DFN.md.recal.table
-G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.md.bam    pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table
+G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.md.bam    pathToFiles/G15511.C09DFN.md.bai pathToFiles/G15511.C09DFN.recal.table
+G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.md.bam    pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.recal.table
 ```
 
-When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. Setting the step `--step recalibrate` will automatically take this file as input.
+When starting Sarek from the mapping step, a TSV file is generated automatically after the `BaseRecalibrator` processes.
+This TSV file is stored under `results/Preprocessing/TSV/duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
+Using `--step recalibrate` will automatically take this file as input.
 
 Additionally, individual TSV files for each sample (`duplicates_marked_[SAMPLE].tsv`) can be found in the same directory.
 
+If `--skip_markduplicates --skip_markduplicates` has been specified, the TSV file for this step will be slightly different:
+
+```text
+G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.bam    pathToFiles/G15511.C09DFN.bai pathToFiles/G15511.C09DFN.recal.table
+G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.bam    pathToFiles/G15511.D0ENMT.bai pathToFiles/G15511.D0ENMT.recal.table
+```
+
+When starting Sarek from the mapping step with `--skip_markduplicates`, a TSV file is generated automatically after the `BaseRecalibrator` processes.
+This TSV file is stored under `results/Preprocessing/TSV/mapped_no_duplicates_marked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files.
+Using `--step recalibrate` will automatically take this file as input.
+
+Additionally, individual TSV files for each sample (`mapped_no_duplicates_marked_[SAMPLE].tsv`) can be found in the same directory.
+
 ## Starting from the variant calling step
 
 A TSV file for a normal/tumor pair with recalibrated BAM files and their indexes can be provided to start Sarek from the variant calling step (`--step variantcalling`).
@@ -165,11 +196,13 @@ The TSV file should contain the columns:
 Here is an example for two samples from the same subject:
 
 ```text
-G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.md.recal.bam    pathToFiles/G15511.C09DFN.md.recal.bai
-G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.md.recal.bam    pathToFiles/G15511.D0ENMT.md.recal.bai
+G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.recal.bam    pathToFiles/G15511.C09DFN.recal.bai
+G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.recal.bam    pathToFiles/G15511.D0ENMT.recal.bai
 ```
 
-When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files. Setting the step `--step variantcalling` will automatically take this file as input.
+When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes.
+This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files.
+Using `--step variantcalling` will automatically take this file as input.
 
 Additionally, individual TSV files for each sample (`recalibrated_[SAMPLE].tsv`) can be found in the same directory.
 
@@ -180,15 +213,16 @@ The TSV needs to contain the following columns:
 
 - `subject sex status sample mpileup`
 
-The same way, if you have non recalibrated BAMs and their indexes, you should use a structure like:
+Here is an example for one normal/tumor pair from one subjects:
 
 ```text
 G15511    XX    0    C09DFN    pathToFiles/G15511.C09DFN.pileup
 G15511    XX    1    D0ENMT    pathToFiles/G15511.D0ENMT.pileup
 ```
 
 When starting Sarek from the Control-FREEC step, a TSV file is generated automatically after the `mpileup` process.
-This TSV file is stored under `results/VariantCalling/TSV/control-freec_mpileup.tsv` and can be used to restart Sarek from the mpileup files. Setting the step `--step Control-FREEC` will automatically take this file as input.
+This TSV file is stored under `results/VariantCalling/TSV/control-freec_mpileup.tsv` and can be used to restart Sarek from the mpileup files.
+Using `--step Control-FREEC` will automatically take this file as input.
 
 Additionally, individual TSV files for each sample (`control-freec_mpileup_[SAMPLE].tsv`) can be found in the same directory.
 

@@ -28,6 +28,7 @@
   - [--markdup_java_options](#--markdup_java_options)
   - [--no_gatk_spark](#--no_gatk_spark)
   - [--save_bam_mapped](#--save_bam_mapped)
+  - [--skip_markduplicates](#--skip_markduplicates)
 - [Variant Calling](#variant-calling)
   - [--ascat_ploidy](#--ascat_ploidy)
   - [--ascat_purity](#--ascat_purity)
@@ -364,6 +365,10 @@ Use this to disable usage of GATK Spark implementation of their tools in local m
 
 Will save mapped BAMs.
 
+### --skip_markduplicates
+
+Will skip MarkDuplicates. This params will also save the mapped BAMS, to enable restart from step `prepare_recalibration`
+
 ## Variant Calling
 
 ### --ascat_ploidy