-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85
Comments
Could you please show me the structure of your gds file? like, f <- seqOpen(gds_filename)
f |
Hi @zhengxwen, Sorry for my delay - I needed to update my email alerts to be sent to my new institutional email address. I should be responsive more quickly now. Here is the output from
|
@zhengxwen just following up on this issue. I’ve tried examining the source code but haven’t been able to identify the root of the error. |
See:
It is expected to be |
Ok I see, thank you very much! I think I see the reason. Our analysis pipeline outputs a gVCF with every genomic position from our molecular inversion probe target region included in the VCF record lines even if there is no variant at each position. In our VCF, at positions without variants, there is no GT field. Each record row which has a variant allele at that position appears to have a properly formatted GT field. Only a small subset of the VCF records have a variant allele and the necessary GT field. Have you come across this before? Is there a way to specify in the If not, perhaps we can get around this by including a simple grep script in the pipeline to remove VCF lines without variants (those lines without the GT field). Here are two screenshots illustrating the different formatting between variant and non-variant record lines: |
Removing VCF lines without variants (those lines without the GT field) would be a good way before importing to GDS. GT: GQ: DP: AD: RO: QR: AO: QA:GL 0/0:127.042:454:454,0:454:15248:0:0:0,-136.668,-1372.15 So I will expect a "corrected" VCF should give you a diploidy GDS file. |
Ok I will proceed with removing the VCF lines without variants. It sounds like SeqArray does not support gVCF to GDS conversion. Is this correct? In other words, SeqArray exclusively supports conversion of regular VCF files and not gVCF files which contain the homozygous reference blocks lacking GT calls which are added by various variant callers (GATK, Freebayes, etc.) during gVCF generation (which contain a line for every position in the genome including numerous non-variant sites)? I think we have a fix to adjust the ploidy in the genotype calls - thank you for pointing that out. |
Not sure how you define gVCF, but the gVCF files I received and processed did have GT for every position (even for the site without variant). Another option might be you add the GT field to every site in your pipeline. |
Hi,
I've used SeqArray::seqVCF2GDS to convert a haploid species VCF v4.2 file to gds format which has variants with multiple alternative alleles due to the nature of our sequencing which involves sequencing combined mixed clonal samples. We require storing any number of alleles per variant position (2+) even though our reference genome is a haploid genome (showing more than 2 alleles is useful to us because it indicates mixed clones of one species in one sample).
I get an error when running various SeqArray functions and I'm wondering if the root cause is that the Ploidy is 1 but there are variants in the GDS file with more than 2 alternative alleles.
When I run
seqSummary(gds_filename)
I get the following output including a warning message which I believe explains why many of the SeqArray functions don't work on our data.Running
seqGetData(gdsfile, "genotype")
outputs:Running
seqGetData(gdsfile, "annotation/format/DS")
One more thing I noticed from the summary output is that under Alleles:
Why are there no ALT alleles recorded?
and for tabulation, why do the alleles with 3, 4, 6, 7, 8, 1 show 0.0%?
Could you help me to identify the cause of these errors and help develop a fix? Do you think the GDS summary: Ploidy=1 while there are greater than 2 alleles is the issue? We would like to implement SeqArray into our R tool suite for molecular inversion probe analysis of mixed polyclonal parasite samples and hope we can get it working.
Thank you very much for your time and help,
George
The text was updated successfully, but these errors were encountered: