Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools missing lower frequency variants? #1523

Closed
charlesfoster opened this issue Jul 7, 2021 · 5 comments
Closed

bcftools missing lower frequency variants? #1523

charlesfoster opened this issue Jul 7, 2021 · 5 comments

Comments

@charlesfoster
Copy link

Hi,
Instead of providing the files in #1459, I thought I'd create a new issue since I found a related problem. In #1459, I described that an indel with relatively high coverage was not being recovered by bcftools mpileup | bcftools call. The deletion occurs in ~26% of reads:

image

The deletion is picked up by a combination of samtools mpileup | ivar, but is not picked up by bcftools mpileup | bcftools call.

Commands:

samtools mpileup -A -d 0 -B -Q 0 --reference $REF $BAM | ivar variants -p $PREFIX -q 20 -m 10 -t 0.1 -r $REF
bcftools mpileup --count-orphans --no-BAQ --max-idepth 1000000 -d 1000000 -Q 0 \
-Ou --min-BQ 20 \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
-f $REF $BAM | \
bcftools call --ploidy 1 --keep-alts --keep-masked-ref --multiallelic-caller --variants-only -Ou --threads 20 | \
bcftools +fill-tags - -- -t FORMAT/VAF |\
bcftools norm -Ou -a -m - | \
bcftools view -i "FORMAT/VAF >= 0.1 || INFO/IMF >= 0.1" -Oz -o ${PREFIX}.bcftools.vcf.gz -
tabix -p vcf ${PREFIX}.bcftools.vcf.gz

I've attached an example BAM file (downsampled from the original) in which ivar picks up the 17 base deletion occurring at a variant allele frequency of ~0.19 (19%), but bcftools does not call the variant.

Similarly, I've found that well supported variants occurring at a frequencies ~0.30 and lower in other samples are missed by bcftools. I've attached an example BAM file where samtools + ivar picks up the following low frequency variants (among other higher freq variants) that are missed by bcftools:

REGION POS REF ALT ALT_FREQ TOTAL_DP
NC_045512.2 11744 G A 0.103604 222
NC_045512.2 28473 C T 0.118644 59

Visualisation of the BAM file in igv shows that the variants appear 'real', e.g.:

image

The commands are the same as above. Any idea what might be going wrong, or how I should change the command(s)? I'd like to switch to bcftools for everything, but I also need to guarantee that I won't be missing lower frequency variants. The reference genome is: https://www.ncbi.nlm.nih.gov/nuccore/1798174254

Thanks!

The two necessary BAM files are in:
example_data.zip

@pd3
Copy link
Member

pd3 commented Jul 8, 2021

What version of bcftools are you using? I just tested with the latest github version and see it detects the deletion, even with default parameters:

bcftools mpileup -a AD -f ref.fa sample_17_deletion.bam  -t NC_045512.2:27606
NC_045512.2	27606	.	AAAACACGTCTATCAGTTA	AA	120.671	.	INDEL;IDV=16;IMF=0.186047;DP=86;VDB=0.988006;SGB=-0.688148;RPBZ=-1.4103;SCBZ=-0.478091;FS=0;MQ0F=0;AC=1;AN=2;DP4=31,36,4,11;MQ=60	GT:PL:AD	0/1:155,0,255:67,15

Similarly with the two SNVs:

bcftools mpileup -a AD -f ref.fa sample_lowfreqvariants.bam  -t NC_045512.2:11744,NC_045512.2:28473
NC_045512.2	11744	.	G	A	45.2732	.	DP=286;VDB=0.348675;SGB=-0.692352;RPBZ=-1.05967;BQBZ=0.682492;SCBZ=0.703965;FS=0;MQ0F=0;AC=1;AN=2;DP4=88,89,11,10;MQ=60	GT:PL:AD	0/1:82,0,255:177,21
NC_045512.2	28473	.	C	T	104.02	.	DP=88;VDB=0.768686;SGB=-0.636426;RPBZ=0.0119403;BQBZ=1.52589;SCBZ=0.896487;FS=0;MQ0F=0;AC=1;AN=2;DP4=27,24,5,2;MQ=60	GT:PL:AD	0/1:140,0,255:51,7

@charlesfoster
Copy link
Author

I'm using the latest github version:

bcftools 1.12-89-g186dc93
Using htslib 1.12-57-g09255e6

If I just use mpileup with either the command I posted originally, or with the minimal command you provided above, I do see the low frequency variants. I use the two low-freq variants + one high-freq variant to demonstrate.

Command:

bcftools mpileup --count-orphans --no-BAQ --max-idepth 1000000 -d 1000000 -Q 0 \
-Ov --min-BQ 20 \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
-t NC_045512.2:11744,NC_045512.2:28472,NC_045512.2:28473 \
-f $REF $BAM | \
bcftools +fill-tags - -- -t FORMAT/VAF

Truncated output:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample_lowfreqvariants.bam
NC_045512.2	11744	.	G	A,<*>	0	.	DP=345;ADF=94,11,0;ADR=105,12,0;AD=199,23,0;I16=94,105,11,12,9856,513154,1158,61136,11940,716400,1380,82800,4113,96743,521,12753;QS=0.894861,0.105139,0;VDB=0.775562;SGB=-0.692717;RPBZ=-0.641355;BQBZ=0.607498;SCBZ=0.727078;FS=0;MQ0F=0	PL:DP:SP:ADF:ADR:AD:VAF	73,0,255,255,255,255:222:0:94,11,0:105,12,0:199,23,0:0.103604,0
NC_045512.2	28472	.	C	T,<*>	0	.	DP=89;ADF=0,33,0;ADR=0,26,0;AD=0,59,0;I16=0,0,33,26,0,0,2895,149691,0,0,3540,212400,0,0,1325,31957;QS=0,1,0;VDB=0.732855;SGB=-0.693147;FS=0;MQ0F=0	PL:DP:SP:ADF:ADR:AD:VAF	255,178,0,255,178,255:59:0:0,33,0:0,26,0:0,59,0:1,0
NC_045512.2	28473	.	C	T,<*>	0	.	DP=89;ADF=28,5,0;ADR=24,2,0;AD=52,7,0;I16=28,24,5,2,2521,128633,378,21042,3120,187200,420,25200,1151,27729,170,4150;QS=0.86961,0.13039,0;VDB=0.768686;SGB=-0.636426;RPBZ=0.0234517;BQBZ=1.58201;SCBZ=0.679352;FS=0;MQ0F=0	PL:DP:SP:ADF:ADR:AD:VAF	137,0,255,255,255,255:59:3:28,5,0:24,2,0:52,7,0:0.118644,0

I would hope that each of these variants would make it into the final vcf file. However, the first and third one are lost after bcftools call:

bcftools mpileup --count-orphans --no-BAQ --max-idepth 1000000 -d 1000000 -Q 0 -Ov --min-BQ 20 --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR -t NC_045512.2:11744,NC_045512.2:28472,NC_045512.2:28473 -f $REF $BAM | bcftools call --ploidy 1 --keep-alts --keep-masked-ref --multiallelic-caller --variants-only -Ov --threads 20 | bcftools +fill-tags - -- -t FORMAT/VAF

Output:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample_lowfreqvariants.bam
NC_045512.2	28472	.	C	T	225.417	.	DP=89;ADF=0,33;ADR=0,26;AD=0,59;VDB=0.732855;SGB=-0.693147;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,33,26;MQ=60	GT:PL:DP:SP:ADF:ADR:AD:VAF	1:255,0:59:0:0,33:0,26:0,59:1

If I use the same command but omit --variants-only:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample_lowfreqvariants.bam
NC_045512.2	11744	.	G	A	30.0732	.	DP=345;ADF=94,11;ADR=105,12;AD=199,23;VDB=0.775562;SGB=-0.692717;RPBZ=-0.641355;BQBZ=0.607498;SCBZ=0.727078;FS=0;MQ0F=0;AC=0;AN=1;DP4=94,105,11,12;MQ=60	GT:PL:DP:SP:ADF:ADR:AD:VAF	0:73,255:222:0:94,11:105,12:199,23:0.103604
NC_045512.2	28472	.	C	T	225.417	.	DP=89;ADF=0,33;ADR=0,26;AD=0,59;VDB=0.732855;SGB=-0.693147;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,33,26;MQ=60	GT:PL:DP:SP:ADF:ADR:AD:VAF	1:255,0:59:0:0,33:0,26:0,59:1
NC_045512.2	28473	.	C	T	30.1974	.	DP=89;ADF=28,5;ADR=24,2;AD=52,7;VDB=0.768686;SGB=-0.636426;RPBZ=0.0234517;BQBZ=1.58201;SCBZ=0.679352;FS=0;MQ0F=0;AC=0;AN=1;DP4=28,24,5,2;MQ=60	GT:PL:DP:SP:ADF:ADR:AD:VAF	0:137,255:59:3:28,5:24,2:52,7:0.118644

The two variants that seem real are being removed during bcftools call, presumably (based on the output) because bcftools considers them low quality? I can see that they have the following extra info tags:

NC_045512.2	11744	.	G	A RPBZ=-0.641355 BQBZ=0.607498 SCBZ=0.727078 
NC_045512.2	28473	.	C	T RPBZ=0.0234517 BQBZ=1.58201 SCBZ=0.679352 

@pd3
Copy link
Member

pd3 commented Jul 9, 2021

The problem is in the --ploidy 1 option. You are telling the program that there is only one expected allele at a position and if there are more, they should be considered sequencing and alignment errors. Why would you ever expect a low frequency variant to be considered the major allele?

@charlesfoster
Copy link
Author

Thanks for pointing that out. I’m working with viral data (SARS-CoV-2), and we expect there to potentially be intra-host diversity. The hope is to capture the intra-host diversity in a VCF file, if it exists. Is this not going to be possible when specifying the ploidy?

@pd3
Copy link
Member

pd3 commented Jul 11, 2021

By setting ploidy to 1, the call algorithm is told to look for a single allele. The default ploidy 2 is able to detect maximum two alleles. It seems that what you want to is to process the mpileup input directly, writing your own caller that models your experimental conditions. You can also take a look at freebayes that can work with bigger ploidies.

I will close the issue now as it is not a bug.

@pd3 pd3 closed this as completed Jul 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants