Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using COUNT with bcftools filter across all samples #2363

Open
mhoban opened this issue Feb 1, 2025 · 2 comments
Open

Using COUNT with bcftools filter across all samples #2363

mhoban opened this issue Feb 1, 2025 · 2 comments

Comments

@mhoban
Copy link

mhoban commented Feb 1, 2025

Is there a way to use bcftools filter with COUNT to return only positions with a certain number of values for a given FORMAT entry across all samples?

For example, I'd like to filter out positions where the number of all AD values is exactly two. I can do this for one sample, like this:

bcftools filter -i 'COUNT(AD[0:*]) == 2' input.vcf.gz > filtered.vcf

Is there a way to do this for all samples? Or do I have to enter each position manually? I've tried various permutations of the indexing syntax but it doesn't quite seem to do what I need.

thanks!

@pd3
Copy link
Member

pd3 commented Feb 20, 2025

I think that should be possible using the reverse logic of -e, for example

bcftools filter -e 'COUNT(AD[0:*]) != 2'

@pd3 pd3 closed this as completed Feb 20, 2025
@mhoban
Copy link
Author

mhoban commented Feb 20, 2025

This issue was closed so I'm not sure if my comment will be seen, but your suggestion doesn't work for me. Using the syntax you gave, it only filters against the first sample (sample 0).

I'm looking for a filter statement that will match the condition against ALL samples.

Consider the following, in which I run your suggested filter, query for REF, ALT, and AD values, then grep for cases where COUNT(AD) is greater than 2:

bcftools filter -e 'COUNT(AD[0:*]) != 2' test.vcf.gz | bcftools query -f '%REF,%ALT[\t%AD]' | ggrep -E '[0-9]+,[0-9]+,[0-9]+' | head -5

Which results in this (shown as a table to make it easier to read):

REF/ALT sample1_AD sample2_AD sample3_AD sample4_AD sample5_AD sample6_AD sample7_AD sample8_AD ample9_AD sample10_AD
G,A 2,0 4,1 9,0 1,0 1,0,0 9,0 4,0 4,2 6,0 1,0
G,A 4,0 2,0 2,0 1,0,0,0,0,0 7,6 4,0 2,0 4,0 4,0 10,4
A,G 4,0 5,0 4,0 4,0 8,0 4,0 10,4 10,0 10,0 6,0,4
C,T 4,0 6,0 4,0 1,0,0 6,6 7,0 2,0 3,2 2,2 2,2
T,C 2,0 0,2 0,1 6,0 1,0,0 8,1 1,0 8,0 1,2 0,1

You can see that for each of these rows, there's at least one sample that has more than two AD values (but it's never sample1). This makes sense, because the syntax of the filter expression you suggested only checks the first sample.

My ultimate goal is to retain sites where ALL samples have the same number of AD values as the total number of REF and ALT alleles, something like this (which doesn't work as written, but seems like it should given the expression syntax documentation):

bcftools filter -i 'COUNT(AD[*:*]) == (COUNT(REF) + COUNT(ALT))'

It's easy enough for me to generate something like this, which checks each sample individually:

bcftools filter -i '(COUNT(AD[0:*]) == (COUNT(REF) + COUNT(ALT))) & (COUNT(AD[1:*]) == (COUNT(REF) + COUNT(ALT))) & (COUNT(AD[2:*]) == (COUNT(REF) + COUNT(ALT))) & ...'

But I was hoping there was a quicker and more elegant way to do it.

@pd3 pd3 reopened this Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants