You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sourmash taxonomy tutorial(s) - euks, genbank, euks+gtdb, etc.
::: info
Files on farm in ~ctbrown/scratch3/2025-sourmash-tax-tutorial/.
::::
:::danger
We recommend using k=51 when searching against eukaryotic genomes, as there seem to be many false positives at smaller ksizes.
:::
:::success
Note that the entire database used here is NCBI eukaryotes + GDTB bacteria + archaea. This could be a problem because the eukaryotic taxonomy is from NCBI, while you can use either the GTDB or NCBI taxonomies for the GTDB genomes. However, sourmash can happily use both types of taxonomy together, and can mix and match as needed, e.g. use the NCBI taxonomy for euks and the GTDB taxonomy for bacteria. This is because sourmash taxonomy is keyed on the Genbank identifier, not on the NCBI taxid.
Note that you do have to choose a singlke lineage for each Genbank identifier, of course. So you can't simultaneously use the NCBI and GTDB lineages for the same bacterial genome!
:::
Suppose we have a fastmultigather output file, SRR4420318.k51.csv, that was generated using the combined euk+bac+arc database.
The corresponding lineages file is here: /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb. (You can also a CSV lineages file, but it is slower; you can generate a sqldb file from a CSV file using sourmash tax prepare.)
Comprehensive kingdom-level taxonomy
We can get a kingdom level breakdown of the gather results like so:
Text output is all well and good, but sometimes you just need a CSV file, amirite? You can annotate any input CSV file with semicolon-separated lineages using sourmash tax annotate:
hackmd here: https://hackmd.io/vhpGc1uYRx-0tVGQUL8GPw?view
[toc]
sourmash taxonomy tutorial(s) - euks, genbank, euks+gtdb, etc.
::: info
Files on farm in
~ctbrown/scratch3/2025-sourmash-tax-tutorial/
.::::
:::danger
We recommend using k=51 when searching against eukaryotic genomes, as there seem to be many false positives at smaller ksizes.
:::
:::success
Note that the
entire
database used here is NCBI eukaryotes + GDTB bacteria + archaea. This could be a problem because the eukaryotic taxonomy is from NCBI, while you can use either the GTDB or NCBI taxonomies for the GTDB genomes. However, sourmash can happily use both types of taxonomy together, and can mix and match as needed, e.g. use the NCBI taxonomy for euks and the GTDB taxonomy for bacteria. This is because sourmash taxonomy is keyed on the Genbank identifier, not on the NCBI taxid.Note that you do have to choose a singlke lineage for each Genbank identifier, of course. So you can't simultaneously use the NCBI and GTDB lineages for the same bacterial genome!
:::
Suppose we have a fastmultigather output file,
SRR4420318.k51.csv
, that was generated using the combined euk+bac+arc database.The corresponding lineages file is here:
/group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb
. (You can also a CSV lineages file, but it is slower; you can generate a sqldb file from a CSV file usingsourmash tax prepare
.)Comprehensive kingdom-level taxonomy
We can get a kingdom level breakdown of the gather results like so:
The resulting breakdown looks like this:
showing that the data is mostly eukaryotic, with some bacteria and very little archaea.
Species-level breakdown of eukaryotes
If you want a breakdown of eukaryotes by species, you can do:
which will give you:
Enabling programmatic investigation
Text output is all well and good, but sometimes you just need a CSV file, amirite? You can annotate any input CSV file with semicolon-separated lineages using
sourmash tax annotate
:This will output a file
SRR4420318.k51.with-lineages.csv
with an added columnlineage
.The text was updated successfully, but these errors were encountered: