Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to annotation files submitted by GOA #185

Closed
tonysawfordebi opened this issue Apr 26, 2016 · 10 comments
Closed

Changes to annotation files submitted by GOA #185

tonysawfordebi opened this issue Apr 26, 2016 · 10 comments
Assignees

Comments

@tonysawfordebi
Copy link
Contributor

We (GOA) currently submit the following files to the GOC SVN repository:

GO-SVN/trunk/gene-associations/submission/

gene_association.goa_<species>.gz - annotations to proteins that are part of the UniProt *complete* proteome, *plus* isoforms, complexes, and RNAs
gene_association.goa_ref_<species>.gz - annotations to canonical accessions from the UniProt gene-centric reference proteome (GCRP)

gene_association.goa_uniprot.gz
gene_association.goa_pdb.gz

GO-SVN/trunk/gpad-gpi/submission/

gp_association.goa_ref_<species>.gz - annotations to canonical accessions from the UniProt gene-centric reference proteome (GCRP)
gp_information.goa_ref_<species>.gz - metadata only for canonical accession from the GCRP

where is one of: human, chicken, cow, dog, pig

At the recent GOC meeting in Geneva, we announced our intention to revise the set of files that we publish as part of our four-weekly release cycle in order to eliminate confusion that has arisen in the past about what the various files contain.

As of the GOA release that is scheduled for the week of 9th May, we intend to change the set of files that we publish and submit to the GOC repository to the following:

GO-SVN/trunk/gene-associations/submission/

goa_<species>.gaf.gz - annotations to canonical accessions from the UniProt GCRP
goa_<species>_isoform.gaf.gz - annotations to isoforms from the UniProt GCRP
goa_<species>_complex.gaf.gz - annotations to complexes
goa_<species>_rna.gaf.gz - annotations to RNAs

goa_uniprot_all.gaf.gz - renamed gene_association.goa_uniprot.gz; identical content
goa_pdb.gaf.gz - renamed gene_association.goa_pdb.gz; identical content

GO-SVN/trunk/gpad-gpi/submission/

goa_<species>.gpa.gz - annotations to canonical accessions from the UniProt GCRP
goa_<species>_isoform.gpa.gz - annotations to isoforms from the UniProt GCRP
goa_<species>_complex.gpa.gz - annotations to complexes
goa_<species>_rna.gpa.gz - annotations to RNAs
goa_uniprot_all.gpa.gz - renamed gp_association.goa_uniprot.gz; identical content

goa_<species>.gpi.gz - metadata for canonical accessions from the UniProt GCRP
goa_<species>_isoform.gpi.gz - metadata for isoforms from the UniProt GCRP
goa_<species>_complex.gpi.gz - metadata for complexes
goa_<species>_rna.gpi.gz - metadata for RNAs

Again, is one of: human, chicken, cow, dog, pig

As you can see, as well as rationalising the contents of the files that we produce, we are also proposing changing their names; we're doing this for two reasons:

  1. it emphasises that the annotation sets in these new files are not the same as those in the current files; the goa_ sets will now contain annotations only to proteins that are canonical accessions from the UniProt GCRP for the species, whereas in the current files they contain annotations to canonical accessions and isoforms from the UniProt complete proteome for the species, plus RNAs and complexes.
  2. it makes more sense for the format of the contents to be indicated by the file suffix rather than prefix

Comments / suggestions / objections?

@cmungall
Copy link
Member

@cmungall
Copy link
Member

cmungall commented Apr 27, 2016

+1 to the change, provided we have addressed issues here: geneontology/go-annotation#1113 -- specifically, do we lose any annotations that are assigned to a non-GCRP when they should be assigned to a GCRP?

May 9 is only a two week lag. It may take some time for people to change their pipelines. One possibility is for the central GO publishing script (aka Mike's script) to produce both the old schema and the new scheme to ease the transition. However, I would advocate against this.

@cmungall
Copy link
Member

Checklist:

  • send advance notification
  • add JSON metadata files (these drive the main website)
  • modify go/software/utilities/filter-gene-association.pl
  • sync changes with legacy CVS repo

@ValWood
Copy link
Contributor

ValWood commented Apr 27, 2016

+10 !

@ValWood
Copy link
Contributor

ValWood commented Apr 27, 2016

One thought.

a) I'm assuming that the "isoform file" will be the contain only the automated annotations on Trembl entries? In which sase it isn't only isoforms, but also any truncated, alternative entry which isn't integrated into the UniProt record (UniProt may want to correct me here).

b) Sometimes a GO annotator will make an experimental annotation to an isoform using the canonical entry using column 17 (UniProtKB:P12345-2)...I'm guessing/hoping these will still be in the main file (because they will have the canonical ID in column 2).

So if the above is correct, the file naming/contents description might be confusing if people really are interested in isoform annotations....However, its probably only a file description issue.....the data spilt described is completely necessary

@cmungall
Copy link
Member

point a, isoforms: just realized I am not clear on what this is. "isoforms from the UniProt GCRP" sounds like a bit of a contradictory term, since unless I'm not mistaken the GCRP by design avoids either true isoforms or alternate records.

Perhaps this is just the current goa_$species, minus GCRP, RNA and complexes. In which case I would go with a name like "_altids" rather than "_isoforms" (sorry, can't think of a better suggestion now). We should also document a use case for this file. I agree "_isoforms" is potentially confusing if these are alternate records rather than true isoforms.

b - my understanding is the same as yours. And I agree it's confusing to have the altid file called "isoforms" as it would lead people to believe they would not get isoforms from the default file.

@vanaukenk
Copy link
Contributor

With regard to Tony's point 2:

  1. it makes more sense for the format of the contents to be indicated by the file suffix rather than prefix

I completely agree, and is the plan that this practice will be adopted consortium-wide?

@tonysawfordebi
Copy link
Contributor Author

In the light of comments about timing, we've decided to postpone rolling out these changes until the GOA release scheduled for the week of 6th June; this gives people an extra four weeks to make any necessary preparations.

@tonysawfordebi
Copy link
Contributor Author

I've created some sample files for human and mouse; you can find them at ftp://ftp.ebi.ac.uk/pub/contrib/goa/new-files/

@cmungall
Copy link
Member

cmungall commented Jun 8, 2016

Continued in #190

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants