To speed up the variant calling processes, the reference is chopped into smaller pieces. The variant calling is done by this intervals, and the different resulting VCFs are then merged. This can parallelize the variant calling processes, and push down the variant calling wall clock time significantly.
The calling intervals can be defined using a .list
or a .bed
file.
A .list
file contains one interval per line in the format chromosome:start-end
(1-based coordinates).
When the intervals file is in BED format, the file must be a tab-separated text file with one interval per line.
There must be at least three columns: chromosome, start, and end.
In BED format, the coordinates are 0-based, so the interval chrom:1-10
becomes chrom<tab>0<tab>10
.
Additionally, the "score" column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval. The fourth column remains unused. Example (the fields would actually be tab-separated, this is not shown here):
chr1 10000 207666 NA 47.3
This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds.
The runtime estimate is used in two different ways. First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned. Second, the jobs with largest processing time are started first, which reduces wall-clock time. If no runtime is given, a time of 1000 nucleotides per second is assumed. Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second.
The file wgs_calling_regions.grch37.list
contains the intervals which are chopped up at the centromeres.
The file wgs_calling_regions_sorted.hg38.list
contains the intervals which are a list of non-N regions, sorted by their size, to enhance wall clock time.
The file wgs_calling_regions.hg38.bed
contains the intervals which are a list of non-N regions in a BED format, with the fifth column containing runtime estimate used to enhance wall clock time.
The file tiny_GRCh37.list
is based on the GRCh37
intervals list file but tailored for the smallReferences
.
The file tiny_GRCh38.list
contains the intervals from the tiny_GRCh37.list
lifted over from GRCh37
to GRCh38
in the following way:
module load bioinfo-tools samtools/1.4 bwa
samtools faidx /sw/data/uppnex/ToolBox/ReferenceAssemblies/hg38make/bundle/2.8/b37/human_g1k_v37_decoy.fasta $(< tiny.list) > intervals.fasta
bwa mem /sw/data/uppnex/ToolBox/hg38bundle/Homo_sapiens_assembly38.fasta intervals.fasta > intervals.sam
grep -v '^@' intervals.sam | awk '{printf("%s:%d-%d\n", $3, $4, $4+$6-1)}' > tiny-GRCh38.list