Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide an improved human-readable output & exploration utility for taxonomy #3535

Open
ctb opened this issue Feb 16, 2025 · 2 comments
Open
Labels
plugin_todo Write a plugin for this!

Comments

@ctb
Copy link
Contributor

ctb commented Feb 16, 2025

based in part on experiences in #3534, I think we could improve our human-readable output for inspecting taxonomy results. this is particularly important as we head towards more comprehensive examination of tax membership

a few specific things come to mind -

  • options to limit display to either total % explained (e.g. "I want to see what explains 99% of tax assignments, and no more" - to avoid the tons of .01% matches), or total number of hits (I only want to see n hits), or only hits above
  • hierarchical output organized by higher level ranks, e.g. "order hits by phylum level % explained, but then show me details for the top x% of species underneath that"

this could definitely be a plugin!

see also https://github.com/ctb/2025-explore-sourmash-gather for a similarly-motivated script for exploring gather results w/o taxonomy.

@ctb ctb added the plugin_todo Write a plugin for this! label Feb 16, 2025
@ctb
Copy link
Contributor Author

ctb commented Feb 16, 2025

with reference to #3537 - perhaps we could have this work entirely on the output of tax annotate, for speed reasons.

@ctb
Copy link
Contributor Author

ctb commented Feb 22, 2025

see https://github.com/ctb/2025-atb-small-genomes/blob/main/round3/parse.py - gives output like so,

% parse.py SAMN07157241.entire.gather.with-lineages.csv -m 1
SAMN07157241 SAMN07157241
        50.0% Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapie...
        37.5% d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria (2 species)
        8.3% d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Stre...
        4.2% unknown

for multiple files, etc. Right now it's specialized for AllTheBacteria/AllTheBacteria#59 but it'd be easy to make it nicer. It's reasonably fast, takes about 5 minutes for 2,000 files containing 3.3m fastmultigather results. Uses polars, of course ;).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin_todo Write a plugin for this!
Projects
None yet
Development

No branches or pull requests

1 participant