tutorial: comprehensive taxonomic characterization of metagenome against eukaryotes + bacteria + archaea #3534

ctb · 2025-02-15T19:49:28Z

hackmd here: https://hackmd.io/vhpGc1uYRx-0tVGQUL8GPw?view

[toc]

sourmash taxonomy tutorial(s) - euks, genbank, euks+gtdb, etc.

::: info
Files on farm in ~ctbrown/scratch3/2025-sourmash-tax-tutorial/.
::::

:::danger
We recommend using k=51 when searching against eukaryotic genomes, as there seem to be many false positives at smaller ksizes.
:::

:::success
Note that the entire database used here is NCBI eukaryotes + GDTB bacteria + archaea. This could be a problem because the eukaryotic taxonomy is from NCBI, while you can use either the GTDB or NCBI taxonomies for the GTDB genomes. However, sourmash can happily use both types of taxonomy together, and can mix and match as needed, e.g. use the NCBI taxonomy for euks and the GTDB taxonomy for bacteria. This is because sourmash taxonomy is keyed on the Genbank identifier, not on the NCBI taxid.

Note that you do have to choose a singlke lineage for each Genbank identifier, of course. So you can't simultaneously use the NCBI and GTDB lineages for the same bacterial genome!
:::

Suppose we have a fastmultigather output file, SRR4420318.k51.csv, that was generated using the combined euk+bac+arc database.

The corresponding lineages file is here: /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb. (You can also a CSV lineages file, but it is slower; you can generate a sqldb file from a CSV file using sourmash tax prepare.)

Comprehensive kingdom-level taxonomy

We can get a kingdom level breakdown of the gather results like so:

sourmash tax metagenome \
    -g SRR4420318.k51.csv \
    -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb \
    -r superkingdom -F human

The resulting breakdown looks like this:

sample name    proportion   cANI   lineage
-----------    ----------   ----   -------
SRR4420318        88.9%     99.5%  Eukaryota
SRR4420318        10.6%     -      unclassified
SRR4420318         0.4%     90.6%  d__Bacteria
SRR4420318         0.0%     80.4%  d__Archaea

showing that the data is mostly eukaryotic, with some bacteria and very little archaea.

Species-level breakdown of eukaryotes

If you want a breakdown of eukaryotes by species, you can do:

sourmash tax metagenome -g SRR4420318.k51.csv -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb -F human -r species | grep Eukaryota

which will give you:

SRR4420318        83.8%     99.5%  Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens
SRR4420318         3.4%     82.3%  Eukaryota;Annelida;Clitellata;Crassiclitellata;Megascolecidae;Amynthas;Amynthas corticis
SRR4420318         0.4%     87.9%  Eukaryota;Chordata;Actinopteri;Characiformes;Alestidae;Tricuspidalestes;Tricuspidalestes caeruleus

Enabling programmatic investigation

Text output is all well and good, but sometimes you just need a CSV file, amirite? You can annotate any input CSV file with semicolon-separated lineages using sourmash tax annotate:

sourmash tax annotate \
    -g SRR4420318.k51.csv \
    -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb

This will output a file SRR4420318.k51.with-lineages.csv with an added column lineage.

The text was updated successfully, but these errors were encountered:

ctb added the tutorial links to tutorials label Feb 15, 2025

This was referenced Feb 16, 2025

provide an improved human-readable output & exploration utility for taxonomy #3535

Open

does tax metagenome work with output of tax annotate? #3537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial: comprehensive taxonomic characterization of metagenome against eukaryotes + bacteria + archaea #3534

tutorial: comprehensive taxonomic characterization of metagenome against eukaryotes + bacteria + archaea #3534

ctb commented Feb 15, 2025 •

edited

Loading

tutorial: comprehensive taxonomic characterization of metagenome against eukaryotes + bacteria + archaea #3534

tutorial: comprehensive taxonomic characterization of metagenome against eukaryotes + bacteria + archaea #3534

Comments

ctb commented Feb 15, 2025 • edited Loading

sourmash taxonomy tutorial(s) - euks, genbank, euks+gtdb, etc.

Comprehensive kingdom-level taxonomy

Species-level breakdown of eukaryotes

Enabling programmatic investigation

ctb commented Feb 15, 2025 •

edited

Loading