Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsem-gff3-to-gtf Fails on GFF3 file from NCBI #92

Open
winni2k opened this issue Mar 22, 2018 · 7 comments
Open

rsem-gff3-to-gtf Fails on GFF3 file from NCBI #92

winni2k opened this issue Mar 22, 2018 · 7 comments

Comments

@winni2k
Copy link

winni2k commented Mar 22, 2018

I would like some help understanding why rsem-gff3-to-gtff is failing.

When I run

rsem-gff3-to-gtf data/yeast/GCF_000146045.2_R64_genomic.gff output

then I get the error message

Traceback (most recent call last):
  File "rsem-gff3-to-gtf", line 254, in <module>
    flush_out(fout)
  File "rsem-gff3-to-gtf", line 208, in flush_out
    for start, end in transcript:
TypeError: iter() returned non-iterator of type 'Transcript'

I downloaded the GFF3 file from NCBI at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/latest_assembly_versions/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz

Here is the header of that file:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-2-1
##sequence-region NC_001133.9 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
NC_001133.9     RefSeq  region  1       230218  .       +       .       ID=id0;Dbxref=taxon:559292;Name=I;chromosome=I;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=S288c
NC_001133.9     RefSeq  telomere        1       801     .       -       .       ID=id1;Dbxref=SGD:S000028862;Note=TEL01L%3B Telomeric region on the left arm of Chromosome I%3B composed of an X element core sequence%2C X element combinatorial repeats%2C and a short terminal stretch of telomeric repeats;gbkey=telomere

When I run

rsem-gff3-to-gtf --version

Then I get this output:

rsem-gff3-to-gtf: error: the following arguments are required: input_GFF3_file, output_GTF_file

usage: rsem-gff3-to-gtf [-h] [--RNA-patterns <patterns>]
                        [--extract-sequences <output.fa>]
                        input_GFF3_file output_GTF_file

Convert GFF3 files to GTF files.

positional arguments:
  input_GFF3_file       Input GFF3 file.
  output_GTF_file       Output GTF file.

optional arguments:
  -h, --help            show this help message and exit
  --RNA-patterns <patterns>
                        Types of RNAs to be extracted, e.g. mRNA,rRNA
                        (default: None)
  --extract-sequences <output.fa>
                        If GFF3 file contains reference sequences, extract
                        them to the specified file (default: None)
@winni2k
Copy link
Author

winni2k commented Mar 22, 2018

According to conda I have version 1.3.0 installed:

$ conda list | grep rsem
rsem                      1.3.0               boost1.64_3    bioconda

@winni2k
Copy link
Author

winni2k commented Mar 22, 2018

Version 1.2.28 runs without throwing an exception, and the output file is not empty.

@EidrianGM
Copy link

EidrianGM commented Mar 12, 2019

The same happens to me when running:

rsem-gff3-to-gtf GCF_000001405.38_GRCh38.p12_genomic.gff human_refseq.gtf
[...]
Loaded 3500000 lines
Loaded 3600000 lines
Cannot recognize transcript rna3's parent rna2, a gene feature might be missing.
[...] failed! Plase check if you provide correct parameters/options for the pipeline!

I tried using the alternative gffread but then rsem finds corrupted the output gtf when trying to create the reference genome:

The GTF file might be corrupted!
Stop at line : NC_000001.11 BestRefSeq exon 17369 17391 . - . transcript_id "rna3";
Error Message: Cannot find gene_id!

@winni2k
Copy link
Author

winni2k commented Mar 12, 2019

I'm not sure this helps, but I have abandoned trying to convert GFFs to GTFs. Instead I now just use the files provided by ENSEMBL at ftp://ftp.ensembl.org/pub/release-95 (They come with versions!).

@EidrianGM
Copy link

EidrianGM commented Mar 12, 2019

Thanks for your quick reply @winni2k ... I think you are right, it will be for the best to use directly ensembl genome and gtf. I am a little obfuscated with this issue since I wanted to use the latest ncbi Hs reference GRCh38.p12 (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.38_GRCh38.p12) though I guess there shouldn't be much different.

If it helps to someboy RSEM could successfully use the gtf outputed by gffread specifying the flag -C (coding only discard mRNAs that have no CDS features) when transforming the ncbi gff:

gffread GCF_000001405.38_GRCh38.p12_genomic.gff -E -T -C -o GCF_000001405.38_GRCh38.p12_genomic.gtf

@winni2k
Copy link
Author

winni2k commented Mar 12, 2019

Alas, the mismatching reference genome is an issue. I did end up switching to the Ensembl reference genome as well for my analyses :/

@EidrianGM
Copy link

Now It seems that I have made possible to use ncbi gff and reference withoutproblems, It just needed to have the flag --gff3-RNA-patterns mRNA,rRNA

This was my command:

rsem-prepare-reference --gff3 GCF_000001405.38_GRCh38.p12_genomic.gff --gff3-RNA-patterns mRNA,rRNA --trusted-sources BestRefSeq,Curated\ Genomic --star --star-path /SeqTools/STAR-2.6.0a/bin/Linux_x86_64 GCF_000001405.38_GRCh38.p12_primassembly.fna ./human_refseq

I hope it suits you well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants