Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cram support + read splitting with seqkit for speedup #388

Merged
merged 73 commits into from
Jul 15, 2021

Conversation

FriederikeHanssen
Copy link
Contributor

crams-everywhere_meme

BAMQC is still a bit of an open problem, especially after base recalibration....

Important change: BamQC + Samtools Stats is only done ONCE BEFORE BaseRecalibration right now: if duplicates are marked it is run after duplicate marking, if they are not it is run after mapping. This reduces the runtime, since MarkDuplicates can take care of merging split reads internally without runtime punishments.
Spark implementation currently only works with singularity. The docker image has issue, and would have to possibly be rebuild :(

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Jun 15, 2021

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 22d55ab

+| ✅ 135 tests passed       |+
#| ❔  11 tests were ignored |#
!| ❗  84 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: environment.yml
  • files_exist - File not found: Dockerfile
  • nextflow_config - Config variable not found: process.container
  • params_used - Config variable not found in main.nf: params.input
  • params_used - Config variable not found in main.nf: params.step
  • params_used - Config variable not found in main.nf: params.genome
  • params_used - Config variable not found in main.nf: params.genomes_base
  • params_used - Config variable not found in main.nf: params.save_reference
  • params_used - Config variable not found in main.nf: params.help
  • params_used - Config variable not found in main.nf: params.no_intervals
  • params_used - Config variable not found in main.nf: params.nucleotides_per_second
  • params_used - Config variable not found in main.nf: params.sentieon
  • params_used - Config variable not found in main.nf: params.skip_qc
  • params_used - Config variable not found in main.nf: params.target_bed
  • params_used - Config variable not found in main.nf: params.tools
  • params_used - Config variable not found in main.nf: params.trim_fastq
  • params_used - Config variable not found in main.nf: params.clip_r1
  • params_used - Config variable not found in main.nf: params.clip_r2
  • params_used - Config variable not found in main.nf: params.three_prime_clip_r1
  • params_used - Config variable not found in main.nf: params.three_prime_clip_r2
  • params_used - Config variable not found in main.nf: params.trim_nextseq
  • params_used - Config variable not found in main.nf: params.save_trimmed
  • params_used - Config variable not found in main.nf: params.split_fastq
  • params_used - Config variable not found in main.nf: params.aligner
  • params_used - Config variable not found in main.nf: params.markdup_java_options
  • params_used - Config variable not found in main.nf: params.use_gatk_spark
  • params_used - Config variable not found in main.nf: params.save_bam_mapped
  • params_used - Config variable not found in main.nf: params.skip_markduplicates
  • params_used - Config variable not found in main.nf: params.ascat_ploidy
  • params_used - Config variable not found in main.nf: params.ascat_purity
  • params_used - Config variable not found in main.nf: params.cf_coeff
  • params_used - Config variable not found in main.nf: params.cf_contamination
  • params_used - Config variable not found in main.nf: params.cf_contamination_adjustment
  • params_used - Config variable not found in main.nf: params.cf_ploidy
  • params_used - Config variable not found in main.nf: params.cf_window
  • params_used - Config variable not found in main.nf: params.generate_gvcf
  • params_used - Config variable not found in main.nf: params.no_strelka_bp
  • params_used - Config variable not found in main.nf: params.pon
  • params_used - Config variable not found in main.nf: params.pon_index
  • params_used - Config variable not found in main.nf: params.ignore_soft_clipped_bases
  • params_used - Config variable not found in main.nf: params.umi
  • params_used - Config variable not found in main.nf: params.read_structure1
  • params_used - Config variable not found in main.nf: params.read_structure2
  • params_used - Config variable not found in main.nf: params.annotate_tools
  • params_used - Config variable not found in main.nf: params.annotation_cache
  • params_used - Config variable not found in main.nf: params.cadd_cache
  • params_used - Config variable not found in main.nf: params.cadd_indels
  • params_used - Config variable not found in main.nf: params.cadd_indels_tbi
  • params_used - Config variable not found in main.nf: params.cadd_wg_snvs
  • params_used - Config variable not found in main.nf: params.cadd_wg_snvs_tbi
  • params_used - Config variable not found in main.nf: params.genesplicer
  • params_used - Config variable not found in main.nf: params.snpeff_cache
  • params_used - Config variable not found in main.nf: params.config_profile_contact
  • params_used - Config variable not found in main.nf: params.config_profile_description
  • params_used - Config variable not found in main.nf: params.config_profile_url
  • params_used - Config variable not found in main.nf: params.outdir
  • params_used - Config variable not found in main.nf: params.publish_dir_mode
  • params_used - Config variable not found in main.nf: params.sequencing_center
  • params_used - Config variable not found in main.nf: params.multiqc_config
  • params_used - Config variable not found in main.nf: params.monochrome_logs
  • params_used - Config variable not found in main.nf: params.email
  • params_used - Config variable not found in main.nf: params.email_on_fail
  • params_used - Config variable not found in main.nf: params.plaintext_email
  • params_used - Config variable not found in main.nf: params.max_multiqc_email_size
  • params_used - Config variable not found in main.nf: params.hostnames
  • params_used - Config variable not found in main.nf: params.validate_params
  • params_used - Config variable not found in main.nf: params.tracedir
  • params_used - Config variable not found in main.nf: params.enable_conda
  • params_used - Config variable not found in main.nf: params.pull_docker_container
  • params_used - Config variable not found in main.nf: params.cpus
  • params_used - Config variable not found in main.nf: params.max_cpus
  • params_used - Config variable not found in main.nf: params.max_memory
  • params_used - Config variable not found in main.nf: params.max_time
  • actions_awsfulltest - .github/workflows/awsfulltest.yml should test full datasets, not -profile test
  • readme - README did not have a Nextflow minimum version badge.
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in awstest.yml: You can customise CI pipeline run tests as required
  • pipeline_todos - TODO string in freebayes.nf: Named file extensions MUST be emitted for ALL output channels
  • pipeline_todos - TODO string in freebayes.nf: List additional required output channels/values here
  • pipeline_todos - TODO string in main.nf: It MUST be possible to pass additional parameters to the tool as a command-line string via the "$ioptions.args" variable
  • pipeline_todos - TODO string in main.nf: If the tool supports multi-threading then you MUST provide the appropriate parameter
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • schema_description - No description provided in schema for parameter: cpus

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/feature_request.md
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: assets/nf-core-sarek_logo.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo.png
  • files_unchanged - File ignored due to lint config: lib/NfcoreSchema.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or foo
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_dev.yml
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_release.yml
  • conda_env_yaml - No environment.yml file found - skipping conda_env_yaml test
  • conda_dockerfile - No environment.yml / Dockerfile file found - skipping conda_dockerfile test

✅ Tests passed:

Run details

  • nf-core/tools version 1.14
  • Run at 2021-07-02 08:20:47

@maxulysse
Copy link
Member

will close #63

@maxulysse
Copy link
Member

@FriederikeHanssen your branch is out-of-date with the base branch due to #390
You might want to sync

@FriederikeHanssen
Copy link
Contributor Author

ups, thanks for the heads up

@FriederikeHanssen
Copy link
Contributor Author

FriederikeHanssen commented Jul 15, 2021

@maxulysse I cleaned up the code a bit now. there are still quite some things that need to be fixed/discussed in separate PRs. I'll add a collection here for context and turn it into cards on the project board:

Things that need to still be fixed from PR #388

  • Adding CSV writing after mapping back in
  • Adding CSV wrirting after duplicate marking back in
  • allow results to be output in bam format
  • Warning if patient/sample are not part of samplesheet
  • Add Merging step of bam files after mapping, if mapping output should be saved
  • fix the channel/file magic that is going on with dbsnp/known indels
  • discuss if we really should have BQSR Spark still in Beta mode. If we keep it, label in big bold letters that it is very experimental
  • better naming for duplicatemarking workflow. Separating it out is tricky in my eyes. But now we are covering two functionalities
  • if skip_markduplicates than map directly to cram and not to bam (see next point too)
  • simplify MDSpark and use cram bamqc container both times. code would be less complicated. Depends on how much longer bamqc would take
  • running multiple tools fails sometimes
  • DeepVariant to cram, modules.conf not passed down
  • Add FREEBayes, TIDDIT, MPILEUP, MANTA for germline VC

The CI tests are not passing anymore, since they are not bams but crams. I don't really know what is going on with the nf-core lint testing. the error message is a bit cryptic to me.

I'll updraft it for you to take a look. then we can see what I should fix here before merge and what can wait for a later PR

@FriederikeHanssen FriederikeHanssen marked this pull request as ready for review July 15, 2021 14:43
Comment on lines +1 to +33
// Import generic module functions
include { initOptions; saveFiles; getSoftwareName } from './functions'

params.options = [:]
options = initOptions(params.options)

process INDEX_TARGET_BED {
tag "$target_bed"
label 'process_medium'
publishDir "${params.outdir}",
mode: params.publish_dir_mode,
saveAs: { filename -> saveFiles(filename:filename, options:params.options, publish_dir:getSoftwareName(task.process), meta:[:], publish_by_meta:[]) }

conda (params.enable_conda ? "bioconda::htslib=1.12" : null)
if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
//TODO: No singularity container at the moment, use docker container for the moment
container "quay.io/biocontainers/htslib:1.12--h9093b5e_1"
} else {
container "quay.io/biocontainers/htslib:1.12--hd3b49d5_0"
}

input:
path target_bed

output:
tuple path("${target_bed}.gz"), path("${target_bed}.gz.tbi")

script:
"""
bgzip --threads ${task.cpus} -c ${target_bed} > ${target_bed}.gz
tabix ${target_bed}.gz
"""
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that in nf-core/modules

try {
includeConfig 'conf/base.config'
} catch (Exception e) {
System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/base.config")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note sure about that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah I had problems were the error messsage for failing to load configs was very confusing. But maybe this is also something to deal with upstream and not in sarek

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worry, we'll fix that

try {
includeConfig 'conf/modules.config'
} catch (Exception e) {
System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/modules.config")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants