You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a large scale dataset of WXS .g.vcfs , I do not have the original FASTA files. I would like to genotype these .g.vcfs into vcfs. These vcfs will be used in downstream applications. The issue is that the g.vcfs were created with GATK HaplotypeCaller without the "--disable-spanning-event-genotyping" flag set to true. This means that * SNPs are introduced so that there is a reported "SNP" at a specific site that an upstream INDEL is spanning (INDEL overlapping a SNP) . This is an issue because I am not interested in * genotypes - my understanding is that this information is already stored in the upstream deletion - and the * are not standard IUPAC base annotation and thus triggers errors in many downstream applications.
What is the correct way to get rid of the reporting to these * ALT alleles?
Looking at bcftools merge I noted the -m ** flag, so I tried this first on the VCF after genotyping:
Neither successfully remove the * ALT alleles.....
WHY DONT YOU JUST REMOVE THE SNPS WITH THE * ALT ALLELE?
The only reason I am hesitant to just remove these variants with grep or awk or something is because I am actually working with sequenced trio, which reports variant phasing information. What I noticed with this data is that after I extract a single subject from the trio vcf (which contains 3 subjects, mother, father and child), I have instances where * ALT variants are present, but the phasing information seems to suggest that they are on the allele opposite of the upstream INDEL. Here is an example - starting with variants in trio VCF:
#CHROM POS REF ALT
chr1 154590147 CCG C
chr1 154590148 CG C
chr1 154590149 G *
chr1 154590149 G C
and then, when I just extract a single subject bcftools query:
#CHROM POS REF ALT GT
chr1 154590148 CG C 0|1
chr1 154590149 G * 1|0
chr1 154590149 G C 0|1
maybe I can just delete these SNPs, because the * annotation was not build with phasing in mind and this is just an artifact.
The text was updated successfully, but these errors were encountered:
jon4thin
changed the title
Remove the * ALT Allele Reporting From a g.vcf Before Genotyping
Remove the * ALT Allele Reporting From a g.vcf Before (or After) Genotyping
Aug 14, 2024
I have a large scale dataset of WXS .g.vcfs , I do not have the original FASTA files. I would like to genotype these .g.vcfs into vcfs. These vcfs will be used in downstream applications. The issue is that the g.vcfs were created with GATK HaplotypeCaller without the "--disable-spanning-event-genotyping" flag set to true. This means that
*
SNPs are introduced so that there is a reported "SNP" at a specific site that an upstream INDEL is spanning (INDEL overlapping a SNP) . This is an issue because I am not interested in*
genotypes - my understanding is that this information is already stored in the upstream deletion - and the*
are not standard IUPAC base annotation and thus triggers errors in many downstream applications.What is the correct way to get rid of the reporting to these
*
ALT alleles?Looking at
bcftools merge
I noted the-m **
flag, so I tried this first on the VCF after genotyping:./bcftools merge **-m none,\*\*** --force-single Genotyped_Sample.vcf.gz -Oz -o merged_Genotyped_Sample.vcf.gz
and then on the original, pre-genotyped g.vcf:
./bcftools merge **-m none,\*\*** --force-single Sample.g.vcf.gz -Oz -o merged_Sample.g.vcf.gz
Neither successfully remove the
*
ALT alleles.....WHY DONT YOU JUST REMOVE THE SNPS WITH THE
*
ALT ALLELE?The only reason I am hesitant to just remove these variants with grep or awk or something is because I am actually working with sequenced trio, which reports variant phasing information. What I noticed with this data is that after I extract a single subject from the trio vcf (which contains 3 subjects, mother, father and child), I have instances where
*
ALT variants are present, but the phasing information seems to suggest that they are on the allele opposite of the upstream INDEL. Here is an example - starting with variants in trio VCF:and then, when I just extract a single subject
bcftools query
:maybe I can just delete these SNPs, because the
*
annotation was not build with phasing in mind and this is just an artifact.The text was updated successfully, but these errors were encountered: