Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tutorial: comprehensive taxonomic characterization of metagenome against eukaryotes + bacteria + archaea #3534

Open
ctb opened this issue Feb 15, 2025 · 0 comments
Labels
tutorial links to tutorials

Comments

@ctb
Copy link
Contributor

ctb commented Feb 15, 2025

hackmd here: https://hackmd.io/vhpGc1uYRx-0tVGQUL8GPw?view

[toc]

sourmash taxonomy tutorial(s) - euks, genbank, euks+gtdb, etc.

::: info
Files on farm in ~ctbrown/scratch3/2025-sourmash-tax-tutorial/.
::::

:::danger
We recommend using k=51 when searching against eukaryotic genomes, as there seem to be many false positives at smaller ksizes.
:::

:::success
Note that the entire database used here is NCBI eukaryotes + GDTB bacteria + archaea. This could be a problem because the eukaryotic taxonomy is from NCBI, while you can use either the GTDB or NCBI taxonomies for the GTDB genomes. However, sourmash can happily use both types of taxonomy together, and can mix and match as needed, e.g. use the NCBI taxonomy for euks and the GTDB taxonomy for bacteria. This is because sourmash taxonomy is keyed on the Genbank identifier, not on the NCBI taxid.

Note that you do have to choose a singlke lineage for each Genbank identifier, of course. So you can't simultaneously use the NCBI and GTDB lineages for the same bacterial genome!
:::

Suppose we have a fastmultigather output file, SRR4420318.k51.csv, that was generated using the combined euk+bac+arc database.

The corresponding lineages file is here: /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb. (You can also a CSV lineages file, but it is slower; you can generate a sqldb file from a CSV file using sourmash tax prepare.)

Comprehensive kingdom-level taxonomy

We can get a kingdom level breakdown of the gather results like so:

sourmash tax metagenome \
    -g SRR4420318.k51.csv \
    -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb \
    -r superkingdom -F human

The resulting breakdown looks like this:

sample name    proportion   cANI   lineage
-----------    ----------   ----   -------
SRR4420318        88.9%     99.5%  Eukaryota
SRR4420318        10.6%     -      unclassified
SRR4420318         0.4%     90.6%  d__Bacteria
SRR4420318         0.0%     80.4%  d__Archaea

showing that the data is mostly eukaryotic, with some bacteria and very little archaea.

Species-level breakdown of eukaryotes

If you want a breakdown of eukaryotes by species, you can do:

sourmash tax metagenome -g SRR4420318.k51.csv -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb -F human -r species | grep Eukaryota

which will give you:

SRR4420318        83.8%     99.5%  Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens
SRR4420318         3.4%     82.3%  Eukaryota;Annelida;Clitellata;Crassiclitellata;Megascolecidae;Amynthas;Amynthas corticis
SRR4420318         0.4%     87.9%  Eukaryota;Chordata;Actinopteri;Characiformes;Alestidae;Tricuspidalestes;Tricuspidalestes caeruleus

Enabling programmatic investigation

Text output is all well and good, but sometimes you just need a CSV file, amirite? You can annotate any input CSV file with semicolon-separated lineages using sourmash tax annotate:

sourmash tax annotate \
    -g SRR4420318.k51.csv \
    -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb 

This will output a file SRR4420318.k51.with-lineages.csv with an added column lineage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tutorial links to tutorials
Projects
None yet
Development

No branches or pull requests

1 participant