Using COUNT with bcftools filter across all samples #2363

mhoban · 2025-02-01T23:09:50Z

Is there a way to use bcftools filter with COUNT to return only positions with a certain number of values for a given FORMAT entry across all samples?

For example, I'd like to filter out positions where the number of all AD values is exactly two. I can do this for one sample, like this:

bcftools filter -i 'COUNT(AD[0:*]) == 2' input.vcf.gz > filtered.vcf

Is there a way to do this for all samples? Or do I have to enter each position manually? I've tried various permutations of the indexing syntax but it doesn't quite seem to do what I need.

thanks!

The text was updated successfully, but these errors were encountered:

pd3 · 2025-02-20T14:17:08Z

I think that should be possible using the reverse logic of -e, for example

bcftools filter -e 'COUNT(AD[0:*]) != 2'

mhoban · 2025-02-20T23:54:43Z

This issue was closed so I'm not sure if my comment will be seen, but your suggestion doesn't work for me. Using the syntax you gave, it only filters against the first sample (sample 0).

I'm looking for a filter statement that will match the condition against ALL samples.

Consider the following, in which I run your suggested filter, query for REF, ALT, and AD values, then grep for cases where COUNT(AD) is greater than 2:

bcftools filter -e 'COUNT(AD[0:*]) != 2' test.vcf.gz | bcftools query -f '%REF,%ALT[\t%AD]' | ggrep -E '[0-9]+,[0-9]+,[0-9]+' | head -5

Which results in this (shown as a table to make it easier to read):

REF/ALT	sample1_AD	sample2_AD	sample3_AD	sample4_AD	sample5_AD	sample6_AD	sample7_AD	sample8_AD	ample9_AD	sample10_AD
G,A	2,0	4,1	9,0	1,0	1,0,0	9,0	4,0	4,2	6,0	1,0
G,A	4,0	2,0	2,0	1,0,0,0,0,0	7,6	4,0	2,0	4,0	4,0	10,4
A,G	4,0	5,0	4,0	4,0	8,0	4,0	10,4	10,0	10,0	6,0,4
C,T	4,0	6,0	4,0	1,0,0	6,6	7,0	2,0	3,2	2,2	2,2
T,C	2,0	0,2	0,1	6,0	1,0,0	8,1	1,0	8,0	1,2	0,1

You can see that for each of these rows, there's at least one sample that has more than two AD values (but it's never sample1). This makes sense, because the syntax of the filter expression you suggested only checks the first sample.

My ultimate goal is to retain sites where ALL samples have the same number of AD values as the total number of REF and ALT alleles, something like this (which doesn't work as written, but seems like it should given the expression syntax documentation):

bcftools filter -i 'COUNT(AD[*:*]) == (COUNT(REF) + COUNT(ALT))'

It's easy enough for me to generate something like this, which checks each sample individually:

bcftools filter -i '(COUNT(AD[0:*]) == (COUNT(REF) + COUNT(ALT))) & (COUNT(AD[1:*]) == (COUNT(REF) + COUNT(ALT))) & (COUNT(AD[2:*]) == (COUNT(REF) + COUNT(ALT))) & ...'

But I was hoping there was a quicker and more elegant way to do it.

pd3 closed this as completed Feb 20, 2025

pd3 reopened this Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using COUNT with bcftools filter across all samples #2363

Using COUNT with bcftools filter across all samples #2363

mhoban commented Feb 1, 2025

pd3 commented Feb 20, 2025

mhoban commented Feb 20, 2025

Using COUNT with bcftools filter across all samples #2363

Using COUNT with bcftools filter across all samples #2363

Comments

mhoban commented Feb 1, 2025

pd3 commented Feb 20, 2025

mhoban commented Feb 20, 2025