Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ww-poreCoV extension #275

Open
hoelzer opened this issue Jul 31, 2024 · 19 comments
Open

ww-poreCoV extension #275

hoelzer opened this issue Jul 31, 2024 · 19 comments
Assignees
Labels
enhancement New feature or request

Comments

@hoelzer
Copy link
Collaborator

hoelzer commented Jul 31, 2024

I suggest using the poreCov pipeline as the backend for SARS-CoV-2 wastewater lineage deconvolution from nanopore long reads. You already added freyja ( #274 #270), which is great as the current community standard.

However, we are also interested in detecting new stuff, aka "cryptic lineages" or novel mutation profiles.

To do this, I would like to test/implement two recent approaches:

  1. CONCOMPRA
  1. Floria

By this, we would get known lineage abundances from freyja plus potential new lineages from one or both of the other tools.

Finally, we could also write a little ww-poreCoV extension paper ;)

@hoelzer hoelzer added the enhancement New feature or request label Jul 31, 2024
@hoelzer
Copy link
Collaborator Author

hoelzer commented Aug 14, 2024

Another interesting method might be VirPool: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05100-3

They illustrated already the advantage that longer amplicons bring w/ their method in particular.

is able to use the entire length of reads instead of just the most informative positions, and can also capture haplotype dependencies within a single read. A crucial property of our model is its ability to capture long-range dependencies within reads, which is particularly relevant when coupled with use of long amplicons and nanopore sequencing.

They also tested the method on real data by sequencing a mixture of eight clinical samples using long amplicons (2kb).

(sidenote, attention was brought up and information was shared by Victor! thx!)

The question is, as usually, how up-to-date is the tool?

https://github.com/fmfi-compbio/virpool

They also provide scripts to create own profiles: https://github.com/fmfi-compbio/virpool?tab=readme-ov-file#creating-a-custom-variant-profile

Maybe covsonar could also create these profiles easily. Or Ashkans sc2mfc tool.

@replikation
Copy link
Owner

Yep, it's completely fine to extend Porecov now on the wastewater surveillance. We just need to make sure the normal "user experience" is not convoluted.

@MarieLataretu
Copy link
Collaborator

MarieLataretu commented Aug 15, 2024

ad Floria:

You would use the VCF + BAM form ARTIC, right? Just to keep in mind: mixed indels might be tricky.
I saw an overlapping x nt deletion and y nt deletion, and neither of both was called by medaka, and/or a frameshift introducing indel was called

edit: corrected tool name; Florida would be also a fun name

@hoelzer
Copy link
Collaborator Author

hoelzer commented Aug 15, 2024

ad Florida:

You would use the VCF + BAM form ARTIC, right? Just to keep in mind: mixed indels might be tricky. I saw an overlapping x nt deletion and y nt deletion, and neither of both was called by medaka, and/or a frameshift introducing indel was called

Yes, I would like to use the output porecov anyway produces to change as little as possible.

However, good point. I would live with such issues for now. Deconvoluting lineages from wastewater is anyway wild west :) but of course, important to keep such situations in mind.

Ps: Floria ;)

@replikation
Copy link
Owner

Not sure if it is possible to implement a subcommand or a second "main.nf" solely for the wastewater part? (thinking samtools subcommand for instance)

@MarieLataretu
Copy link
Collaborator

I think subcommands are rather unusual - afaik, this is mainly handled by a parameter.
E.g. viralrecon has a platform parameter for either Illumina, or nanopore:
https://github.com/nf-core/viralrecon/blob/3731dd3a32a67a2648ea22c2bd980c224abdaee2/main.nf#L62-L76

Also, it could make trouble with the execution from GitHub nextflow run replikation/poreCov ... (or it needs some extra configuration)

@hoelzer
Copy link
Collaborator Author

hoelzer commented Aug 21, 2024

Started working on this in a branch ww-porecov. First step was a container for CONCOMPRA which was already a bit pain, but made it finally

rkimf1/concompra:v0.0.1--f6c273d

@hoelzer
Copy link
Collaborator Author

hoelzer commented Oct 24, 2024

The authors of CONCOMPRA provide a docker now: willemstock/concompra:version0.0.2

However, we need to see if that works w/ nextflow

@hoelzer
Copy link
Collaborator Author

hoelzer commented Nov 14, 2024

This might be even more interesting to add as a process to ww-poreCov:

One big questions is probably: how do they handle the reference used for lineage assignments at the end? And: can they detect something new/cryptic

@hoelzer
Copy link
Collaborator Author

hoelzer commented Dec 16, 2024

Looking into HERCULES

  • they have a pre-build docker container but we might want our own to work w/ nextflow (its maybe a bit cumbersome to use otherwise)

  • they provide option for own reference, based on aligned spike genes:

To update the set of references that HERCULES uses, you need a fasta file with all the Spike-genes of the reference sequences aligned. The names of the sequences inside the fasta file must have the following structure IDXX_LineageY where IDXX is the unique identifier for the sequence and the LineageY is the lineage assigned to the sequence, it can be pangolin lineage or any other nomenclature system (e.g. nextclade clade ID, WHO nomenclature, etc). The fasta file must be stored in a folder that must be mounted inside HERCULES with the following flag. -v Path/to/Folder:/Reference

This might be a good option to build an up-to-date reference based on "representative spike sequences" per lineage (https://github.com/rki-mf1/sc2-mutation-frequency-calculator)

@hoelzer
Copy link
Collaborator Author

hoelzer commented Dec 16, 2024

re-tagged the provided image to mhoelzer/hercules:0.0.0 bc the provided name (ghcr.io/garcia-nacho/hercules) crashes some process in poreCov

@hoelzer
Copy link
Collaborator Author

hoelzer commented Dec 16, 2024

Unfortunately, the container is a pain to use in Nextflow. Here is the Dockerfile and how then the tool is executed (a bash script):

https://github.com/garcia-nacho/HERCULES/blob/master/Dockerfile#L76C19-L76C57

Probably, we need first to re-build the container. I tried to just docker build the original Dockerfile but this fails early (on my MacBook).

@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 10, 2025

garcia-nacho/HERCULES#2 (comment)

What I would recommend you is to run HERCULES several times, one per amplicon, and to compare/integrate the results. The larger the amplicons the better (i.e., Midnight and ViralAmp protocols)

Hm okay.. similar issue like with CONCOMPRA

@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 24, 2025

Is this maybe interesting for our purpose of detecting smt new?

https://github.com/zhuangx15/ICAvar
https://www.medrxiv.org/content/10.1101/2024.04.18.24306052v1.full

@hoelzer
Copy link
Collaborator Author

hoelzer commented Feb 13, 2025

@hoelzer
Copy link
Collaborator Author

hoelzer commented Feb 19, 2025

Another package from colleagues in France (Hughes has contact) where they reimplemented ww tools and making them faster

Also

... indeed the haplotype callers should do the task that you need. If you want to make full use of the long reads, you will need to start from the BAM and count the mutation cooccurrences. The tool from El Hacene DJAOUT does that: https://pypi.org/project/VaRaPS/1.0.0

@hoelzer
Copy link
Collaborator Author

hoelzer commented Feb 19, 2025

Another interesting method might be VirPool: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05100-3

They illustrated already the advantage that longer amplicons bring w/ their method in particular.

is able to use the entire length of reads instead of just the most informative positions, and can also capture haplotype dependencies within a single read. A crucial property of our model is its ability to capture long-range dependencies within reads, which is particularly relevant when coupled with use of long amplicons and nanopore sequencing.

They also tested the method on real data by sequencing a mixture of eight clinical samples using long amplicons (2kb).

(sidenote, attention was brought up and information was shared by Victor! thx!)

The question is, as usually, how up-to-date is the tool?

https://github.com/fmfi-compbio/virpool

They also provide scripts to create own profiles: https://github.com/fmfi-compbio/virpool?tab=readme-ov-file#creating-a-custom-variant-profile

Maybe covsonar could also create these profiles easily. Or Ashkans sc2mfc tool.

Virpool might be promising...

you have the choice between : Virpool (cooccurrence), Frejya (frequencies), and LCS (counts)

And the tool in the post before has a faster reimplementation of virpool

@hoelzer
Copy link
Collaborator Author

hoelzer commented Feb 20, 2025

Thomas is testing: https://github.com/bluenote-1577/devider

@hoelzer
Copy link
Collaborator Author

hoelzer commented Feb 25, 2025

That looks also promising, VILOCA:

https://academic.oup.com/nargab/article/6/4/lqae152/7912062

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants