-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
classification with full length 16S gene #1000
Comments
Hi, I've been reading the tutorials on LCA. I have a bunch of full length 16S reads (~ 1.5 kb), and I want to use silva138 nr 99 database to classify them. Are the following steps the best way (basically treating 16S genes as "genomes")?
There are 374,222 sequences in silva release 138, are there ways to avoid generated individual signatures for each 16S gene in steps 1 and 5? I'd love to hear from you guys. Cheers, |
Hi again, I did a small trial with silva 138 sequences and is at the step of generating lca database for full-length 16S sequences. I used such command to generate signatures:
The generated sig file looks like (everything is in one line):
Then I try to build lca database with:
There was error regarding the signatures:
Is this method suitable for 16S gene classification or there's something wrong I did? Many thanks! |
huh, that's interesting! two thoughts -- try removing the .sig files and regenerating signatures with second, we have had some interest in this in the past - see #548 - and also a postdoc in the lab did some work on it, and concluded that it would work. You may simply need to use a smaller scaled value. apologies for delay in response & happy to chat more, but let's see if we can get this error fixed. |
Hi, I tried with Commands:
Error:
Is the signature file look alright to you? My signatures are everything in one line, but apparently the signatures from the delmont-subsamples-sigs looks different.
The sequences look like:
The taxonomy csv:
Kind regards, |
Ah-hah! Try adding --scaled=100 to the sourmash lca index! It defaults
to 10,000 which is much too high for 16s.
Thank you for posting the signatures -- that excluded one theory,
and made it clear what the other one was :)
I'm going to put some more error messages into lca index to warn about this!
Also! Down the road, I think a major tuning parameter for you to consider
is scaled. A scaled of 1 may be appropriate in this instance, in fact -
keeping all of the k-mers...
best,
--titus
On Sun, May 31, 2020 at 11:44:59PM -0700, Gani wrote:
Hi,
I tried with `--scale 100` and having the same error:
Commands:
`for i in ./ref1/*fasta; do sourmash compute --scaled 100 $i -o $i.sig -k 21; done`
`sourmash lca index -f ref1-taxonomy.csv ref1.lca.json ./ref1/*sig`
Error:
```
== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
examining spreadsheet headers...
** assuming column 'ID' is identifiers in spreadsheet
** assuming column 'Domain' is superkingdom in spreadsheet
5 distinct identities in spreadsheet out of 5 rows.
5 distinct lineages in spreadsheet out of 5 rows.
ERROR: no hash values found - are there any signatures?
```
Is the signature file look alright to you? My signatures are everything in one line, but apparently the signatures from the delmont-subsamples-sigs looks different.
```
[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"./ref1/group_1.fasta","license":"CC0","signatures":[{"num":0,"ksize":21,"seed":42,"max_hash":184467440737095520,"mins":[5873293062244743,6708306589402593,25910513113457246,29049928113056352,43419558076788542,52116337125800102,76848212431295266,86161699444494465,89850708486462310,110143751296994050,120748837069271558,124814579368166673,131109090961182998,139986949992840010,152428826484152745,157313275701646212,183501513726388075],"md5sum":"4d5621999235d72f8f500d92d2865c29","molecule":"dna"}],"version":0.4}]
```
The sequences look like:
```
>AB000106.1.1343
GGAATCTGCCCTTGGGTTCGGAATAACGTCTGGAAACGGACGCTAATACCGGATGATGAC
GTAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGAGG
TAAAGGCTCACCAAGGCGACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGG
ACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCG
AAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTT
ACCCGGGATGATAATGACAGTACCGGGAGAATAAGCCCCGGCTAACTCCGTGCCAGCAGC
CGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGG
CGGCGATTTAAGTCAGAGGTGAAAGCCCGGGGCTCAACCCCGGAATAGCCTTTGAGACTG
GATTGCTTGAATCCGGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGAT
```
The taxonomy csv:
```
ID,Domain,Phylum,Class,Order,Family,Genus,Species
AB000106.1.1343,Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Sphingobium,Sphingomonas sp.
HL182401.2.1459,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,unidentified
HG529990.1.1403,Bacteria,Bacteroidota,Bacteroidia,Cytophagales,Cyclobacteriaceae,Algoriphagus,sp. AK58
FW305570.1.1437,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Mitsuaria,unidentified
FW496046.1.1514,Bacteria,Proteobacteria,Gammaproteobacteria,Burkholderiales,Comamonadaceae,Variovorax,unidentified
```
Kind regards,
Gaofeng
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#1000 (comment)
--
C. Titus Brown, [email protected]
|
closing for now. |
Hi,
I'd like to develop a workflow to use sourmash to classify bacteria sequenced for full length 16S gene, from the nanopore sequencing platform. Just wondering if there already some work done in this area?
Kind regards,
Gaofeng
The text was updated successfully, but these errors were encountered: