Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the * ALT Allele Reporting From a g.vcf Before (or After) Genotyping #2258

Open
jon4thin opened this issue Aug 14, 2024 · 1 comment

Comments

@jon4thin
Copy link

jon4thin commented Aug 14, 2024

I have a large scale dataset of WXS .g.vcfs , I do not have the original FASTA files. I would like to genotype these .g.vcfs into vcfs. These vcfs will be used in downstream applications. The issue is that the g.vcfs were created with GATK HaplotypeCaller without the "--disable-spanning-event-genotyping" flag set to true. This means that * SNPs are introduced so that there is a reported "SNP" at a specific site that an upstream INDEL is spanning (INDEL overlapping a SNP) . This is an issue because I am not interested in * genotypes - my understanding is that this information is already stored in the upstream deletion - and the * are not standard IUPAC base annotation and thus triggers errors in many downstream applications.

What is the correct way to get rid of the reporting to these * ALT alleles?

Looking at bcftools merge I noted the -m ** flag, so I tried this first on the VCF after genotyping:

./bcftools merge **-m none,\*\*** --force-single Genotyped_Sample.vcf.gz -Oz -o merged_Genotyped_Sample.vcf.gz

and then on the original, pre-genotyped g.vcf:

./bcftools merge **-m none,\*\*** --force-single Sample.g.vcf.gz -Oz -o merged_Sample.g.vcf.gz

Neither successfully remove the * ALT alleles.....

WHY DONT YOU JUST REMOVE THE SNPS WITH THE * ALT ALLELE?

The only reason I am hesitant to just remove these variants with grep or awk or something is because I am actually working with sequenced trio, which reports variant phasing information. What I noticed with this data is that after I extract a single subject from the trio vcf (which contains 3 subjects, mother, father and child), I have instances where * ALT variants are present, but the phasing information seems to suggest that they are on the allele opposite of the upstream INDEL. Here is an example - starting with variants in trio VCF:

#CHROM  POS      REF     ALT
chr1    154590147  CCG     C
chr1    154590148  CG      C
chr1    154590149  G       *
chr1    154590149  G       C

and then, when I just extract a single subject bcftools query:

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   *      1|0
chr1     154590149   G   C      0|1

maybe I can just delete these SNPs, because the * annotation was not build with phasing in mind and this is just an artifact.

@jon4thin jon4thin changed the title Remove the * ALT Allele Reporting From a g.vcf Before Genotyping Remove the * ALT Allele Reporting From a g.vcf Before (or After) Genotyping Aug 14, 2024
@pd3
Copy link
Member

pd3 commented Sep 9, 2024

If the spanning star allele is the only ALT, then simply removing the record would work. However, not for cases like this

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   A,*    1|2
chr1     154590149   G   C      0|1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants