Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

classification with full length 16S gene #1000

Closed
ganiatgithub opened this issue May 26, 2020 · 6 comments
Closed

classification with full length 16S gene #1000

ganiatgithub opened this issue May 26, 2020 · 6 comments

Comments

@ganiatgithub
Copy link

Hi,

I'd like to develop a workflow to use sourmash to classify bacteria sequenced for full length 16S gene, from the nanopore sequencing platform. Just wondering if there already some work done in this area?

Kind regards,
Gaofeng

@ganiatgithub
Copy link
Author

ganiatgithub commented May 26, 2020

Hi,

I've been reading the tutorials on LCA. I have a bunch of full length 16S reads (~ 1.5 kb), and I want to use silva138 nr 99 database to classify them. Are the following steps the best way (basically treating 16S genes as "genomes")?

  1. Gete the silva 138 nr 99 reads (full-length 16S genes)
  2. Generate signatures for each 16S gene
  3. Get a taxonomy spreasheet in csv
  4. Build a LCA database in .json with the above inputs
  5. Generate signatures for each query 16S read and use sourmash lca classify for classification.

There are 374,222 sequences in silva release 138, are there ways to avoid generated individual signatures for each 16S gene in steps 1 and 5?

I'd love to hear from you guys.

Cheers,
Gaofeng

@ganiatgithub
Copy link
Author

Hi again,

I did a small trial with silva 138 sequences and is at the step of generating lca database for full-length 16S sequences.

I used such command to generate signatures:

for i in ./ref1/*fa; do sourmash compute ./ref1/$i -o $i.sig -k 21; done

The generated sig file looks like (everything is in one line):

[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"./ref1/AB000106.1.1343.fa","license":"CC0","signatures":[{"num":0,"ksize":31,"seed":42,"max_hash":18446744073709552,"mins":[],"md5sum":"c16a5320fa475530d9583c34fd356ef5","molecule":"dna"}],"version":0.4}]

Then I try to build lca database with:

sourmash lca index -f ref1-taxonomy.csv ref1.lca.json ./ref1/*.sig

There was error regarding the signatures:

== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
examining spreadsheet headers...
** assuming column 'ID' is identifiers in spreadsheet
** assuming column 'Domain' is superkingdom in spreadsheet
5 distinct identities in spreadsheet out of 5 rows.
5 distinct lineages in spreadsheet out of 5 rows.
ERROR: no hash values found - are there any signatures?`

Is this method suitable for 16S gene classification or there's something wrong I did?

Many thanks!

@ctb
Copy link
Contributor

ctb commented May 31, 2020

huh, that's interesting!

two thoughts --

try removing the .sig files and regenerating signatures with sourmash compute --scaled 100 ... -- there may be an error condition that we are failing to flag, where signatures that have not been computed with --scaled are not being loaded!

second, we have had some interest in this in the past - see #548 - and also a postdoc in the lab did some work on it, and concluded that it would work. You may simply need to use a smaller scaled value.

apologies for delay in response & happy to chat more, but let's see if we can get this error fixed.

@ganiatgithub
Copy link
Author

Hi,

I tried with --scale 100 and having the same error:

Commands:
for i in ./ref1/*fasta; do sourmash compute --scaled 100 $i -o $i.sig -k 21; done

sourmash lca index -f ref1-taxonomy.csv ref1.lca.json ./ref1/*sig

Error:

== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

examining spreadsheet headers...
** assuming column 'ID' is identifiers in spreadsheet
** assuming column 'Domain' is superkingdom in spreadsheet
5 distinct identities in spreadsheet out of 5 rows.
5 distinct lineages in spreadsheet out of 5 rows.
ERROR: no hash values found - are there any signatures?

Is the signature file look alright to you? My signatures are everything in one line, but apparently the signatures from the delmont-subsamples-sigs looks different.

[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"./ref1/group_1.fasta","license":"CC0","signatures":[{"num":0,"ksize":21,"seed":42,"max_hash":184467440737095520,"mins":[5873293062244743,6708306589402593,25910513113457246,29049928113056352,43419558076788542,52116337125800102,76848212431295266,86161699444494465,89850708486462310,110143751296994050,120748837069271558,124814579368166673,131109090961182998,139986949992840010,152428826484152745,157313275701646212,183501513726388075],"md5sum":"4d5621999235d72f8f500d92d2865c29","molecule":"dna"}],"version":0.4}]

The sequences look like:

>AB000106.1.1343
GGAATCTGCCCTTGGGTTCGGAATAACGTCTGGAAACGGACGCTAATACCGGATGATGAC
GTAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGAGG
TAAAGGCTCACCAAGGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGG
ACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCG
AAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTT
ACCCGGGATGATAATGACAGTACCGGGAGAATAAGCCCCGGCTAACTCCGTGCCAGCAGC
CGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGG
CGGCGATTTAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAATAGCCTTTGAGACTG
GATTGCTTGAATCCGGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGAT

The taxonomy csv:

ID,Domain,Phylum,Class,Order,Family,Genus,Species
AB000106.1.1343,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Sphingomonas sp.
HL182401.2.1459,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,unidentified
HG529990.1.1403,Bacteria,Bacteroidota,Bacteroidia,Cytophagales,Cyclobacteriaceae,Algoriphagus,sp. AK58
FW305570.1.1437,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Mitsuaria,unidentified
FW496046.1.1514,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Variovorax,unidentified

Kind regards,
Gaofeng

@ctb
Copy link
Contributor

ctb commented Jun 1, 2020 via email

@ctb
Copy link
Contributor

ctb commented Jun 26, 2021

closing for now.

@ctb ctb closed this as completed Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants