classification with full length 16S gene #1000

ganiatgithub · 2020-05-26T01:55:14Z

Hi,

I'd like to develop a workflow to use sourmash to classify bacteria sequenced for full length 16S gene, from the nanopore sequencing platform. Just wondering if there already some work done in this area?

Kind regards,
Gaofeng

ganiatgithub · 2020-05-26T05:36:52Z

Hi,

I've been reading the tutorials on LCA. I have a bunch of full length 16S reads (~ 1.5 kb), and I want to use silva138 nr 99 database to classify them. Are the following steps the best way (basically treating 16S genes as "genomes")?

Gete the silva 138 nr 99 reads (full-length 16S genes)
Generate signatures for each 16S gene
Get a taxonomy spreasheet in csv
Build a LCA database in .json with the above inputs
Generate signatures for each query 16S read and use sourmash lca classify for classification.

There are 374,222 sequences in silva release 138, are there ways to avoid generated individual signatures for each 16S gene in steps 1 and 5?

I'd love to hear from you guys.

Cheers,
Gaofeng

ganiatgithub · 2020-05-26T07:26:10Z

Hi again,

I did a small trial with silva 138 sequences and is at the step of generating lca database for full-length 16S sequences.

I used such command to generate signatures:

for i in ./ref1/*fa; do sourmash compute ./ref1/$i -o $i.sig -k 21; done

The generated sig file looks like (everything is in one line):

[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"./ref1/AB000106.1.1343.fa","license":"CC0","signatures":[{"num":0,"ksize":31,"seed":42,"max_hash":18446744073709552,"mins":[],"md5sum":"c16a5320fa475530d9583c34fd356ef5","molecule":"dna"}],"version":0.4}]

Then I try to build lca database with:

sourmash lca index -f ref1-taxonomy.csv ref1.lca.json ./ref1/*.sig

There was error regarding the signatures:

== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
examining spreadsheet headers...
** assuming column 'ID' is identifiers in spreadsheet
** assuming column 'Domain' is superkingdom in spreadsheet
5 distinct identities in spreadsheet out of 5 rows.
5 distinct lineages in spreadsheet out of 5 rows.
ERROR: no hash values found - are there any signatures?`

Is this method suitable for 16S gene classification or there's something wrong I did?

Many thanks!

ctb · 2020-05-31T16:15:01Z

huh, that's interesting!

two thoughts --

try removing the .sig files and regenerating signatures with sourmash compute --scaled 100 ... -- there may be an error condition that we are failing to flag, where signatures that have not been computed with --scaled are not being loaded!

second, we have had some interest in this in the past - see #548 - and also a postdoc in the lab did some work on it, and concluded that it would work. You may simply need to use a smaller scaled value.

apologies for delay in response & happy to chat more, but let's see if we can get this error fixed.

ganiatgithub · 2020-06-01T06:44:47Z

Hi,

I tried with --scale 100 and having the same error:

Commands:
for i in ./ref1/*fasta; do sourmash compute --scaled 100 $i -o $i.sig -k 21; done

sourmash lca index -f ref1-taxonomy.csv ref1.lca.json ./ref1/*sig

Error:

== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

examining spreadsheet headers...
** assuming column 'ID' is identifiers in spreadsheet
** assuming column 'Domain' is superkingdom in spreadsheet
5 distinct identities in spreadsheet out of 5 rows.
5 distinct lineages in spreadsheet out of 5 rows.
ERROR: no hash values found - are there any signatures?

Is the signature file look alright to you? My signatures are everything in one line, but apparently the signatures from the delmont-subsamples-sigs looks different.

[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"./ref1/group_1.fasta","license":"CC0","signatures":[{"num":0,"ksize":21,"seed":42,"max_hash":184467440737095520,"mins":[5873293062244743,6708306589402593,25910513113457246,29049928113056352,43419558076788542,52116337125800102,76848212431295266,86161699444494465,89850708486462310,110143751296994050,120748837069271558,124814579368166673,131109090961182998,139986949992840010,152428826484152745,157313275701646212,183501513726388075],"md5sum":"4d5621999235d72f8f500d92d2865c29","molecule":"dna"}],"version":0.4}]

The sequences look like:

>AB000106.1.1343
GGAATCTGCCCTTGGGTTCGGAATAACGTCTGGAAACGGACGCTAATACCGGATGATGAC
GTAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGAGG
TAAAGGCTCACCAAGGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGG
ACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCG
AAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTT
ACCCGGGATGATAATGACAGTACCGGGAGAATAAGCCCCGGCTAACTCCGTGCCAGCAGC
CGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGG
CGGCGATTTAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAATAGCCTTTGAGACTG
GATTGCTTGAATCCGGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGAT

The taxonomy csv:

ID,Domain,Phylum,Class,Order,Family,Genus,Species
AB000106.1.1343,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Sphingomonas sp.
HL182401.2.1459,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,unidentified
HG529990.1.1403,Bacteria,Bacteroidota,Bacteroidia,Cytophagales,Cyclobacteriaceae,Algoriphagus,sp. AK58
FW305570.1.1437,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Mitsuaria,unidentified
FW496046.1.1514,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Variovorax,unidentified

Kind regards,
Gaofeng

ctb · 2020-06-01T12:28:19Z

Ah-hah! Try adding --scaled=100 to the sourmash lca index! It defaults to 10,000 which is much too high for 16s. Thank you for posting the signatures -- that excluded one theory, and made it clear what the other one was :) I'm going to put some more error messages into lca index to warn about this! Also! Down the road, I think a major tuning parameter for you to consider is scaled. A scaled of 1 may be appropriate in this instance, in fact - keeping all of the k-mers... best, --titus

On Sun, May 31, 2020 at 11:44:59PM -0700, Gani wrote: Hi, I tried with `--scale 100` and having the same error: Commands: `for i in ./ref1/*fasta; do sourmash compute --scaled 100 $i -o $i.sig -k 21; done` `sourmash lca index -f ref1-taxonomy.csv ref1.lca.json ./ref1/*sig` Error: ``` == This is sourmash version 3.3.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == examining spreadsheet headers... ** assuming column 'ID' is identifiers in spreadsheet ** assuming column 'Domain' is superkingdom in spreadsheet 5 distinct identities in spreadsheet out of 5 rows. 5 distinct lineages in spreadsheet out of 5 rows. ERROR: no hash values found - are there any signatures? ``` Is the signature file look alright to you? My signatures are everything in one line, but apparently the signatures from the delmont-subsamples-sigs looks different. ``` [{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"./ref1/group_1.fasta","license":"CC0","signatures":[{"num":0,"ksize":21,"seed":42,"max_hash":184467440737095520,"mins":[5873293062244743,6708306589402593,25910513113457246,29049928113056352,43419558076788542,52116337125800102,76848212431295266,86161699444494465,89850708486462310,110143751296994050,120748837069271558,124814579368166673,131109090961182998,139986949992840010,152428826484152745,157313275701646212,183501513726388075],"md5sum":"4d5621999235d72f8f500d92d2865c29","molecule":"dna"}],"version":0.4}] ``` The sequences look like: ``` >AB000106.1.1343 GGAATCTGCCCTTGGGTTCGGAATAACGTCTGGAAACGGACGCTAATACCGGATGATGAC GTAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGAGG TAAAGGCTCACCAAGGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGG ACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCG AAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTT ACCCGGGATGATAATGACAGTACCGGGAGAATAAGCCCCGGCTAACTCCGTGCCAGCAGC CGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGG CGGCGATTTAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAATAGCCTTTGAGACTG GATTGCTTGAATCCGGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGAT ``` The taxonomy csv: ``` ID,Domain,Phylum,Class,Order,Family,Genus,Species AB000106.1.1343,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Sphingomonas sp. HL182401.2.1459,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,unidentified HG529990.1.1403,Bacteria,Bacteroidota,Bacteroidia,Cytophagales,Cyclobacteriaceae,Algoriphagus,sp. AK58 FW305570.1.1437,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Mitsuaria,unidentified FW496046.1.1514,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Variovorax,unidentified ``` Kind regards, Gaofeng -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #1000 (comment)

-- C. Titus Brown, [email protected]

ctb · 2021-06-26T13:36:56Z

closing for now.

ctb mentioned this issue Jun 6, 2020

[MRG] Enable moltypes other than DNA in LCA databases #1013

Merged

8 tasks

This was referenced Aug 10, 2020

[MRG] Explore addition of a sketch command. #1159

Merged

potential next steps for the sourmash sketch command #1169

Open

ctb mentioned this issue Mar 29, 2021

How can I build a Sourmash LCA database for 16S rRNA gene sequence datasets #1421

Open

ctb mentioned this issue Apr 23, 2021

what about including marker gene "sketches" in sourmash? #670

Open

ctb closed this as completed Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classification with full length 16S gene #1000

classification with full length 16S gene #1000

ganiatgithub commented May 26, 2020

ganiatgithub commented May 26, 2020 •

edited

Loading

ganiatgithub commented May 26, 2020

ctb commented May 31, 2020

ganiatgithub commented Jun 1, 2020

ctb commented Jun 1, 2020 via email

ctb commented Jun 26, 2021

classification with full length 16S gene #1000

classification with full length 16S gene #1000

Comments

ganiatgithub commented May 26, 2020

ganiatgithub commented May 26, 2020 • edited Loading

ganiatgithub commented May 26, 2020

ctb commented May 31, 2020

ganiatgithub commented Jun 1, 2020

ctb commented Jun 1, 2020 via email

ctb commented Jun 26, 2021

ganiatgithub commented May 26, 2020 •

edited

Loading