Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextflow needs to keep track of the Wave containers it puts in our ECR #5801

Open
stevekm opened this issue Feb 18, 2025 · 3 comments
Open

Comments

@stevekm
Copy link
Contributor

stevekm commented Feb 18, 2025

We can currently use the built-in features with Wave to deposit a copy of the final built Wave container into our private ECR

https://www.nextflow.io/docs/latest/reference/config.html#wave

wave.build.repository = "1234567890.dkr.ecr.us-east-1.amazonaws.com/wave/build"

and so when Nextflow runs with Wave + Fusion enabled, if the original container for a Nextflow process was something like this

container = "1234567890.dkr.ecr.us-east-1.amazonaws.com/my-fastqc-custom:0.1"

And then Wave updates the container for usage with Fusion, the resulting container URI that is recorded in the workflow execution is this;

wave.seqera.io/wt/abcd09876/my-fastqc-custom:0.1

However, the contents of our ECR's wave/build repo ends up with containers that look like this;

1234567890.dkr.ecr.us-east-1.amazonaws.com/wave/build:05a167792ff1748a
1234567890.dkr.ecr.us-east-1.amazonaws.com/wave/build:85f1471232f837f
1234567890.dkr.ecr.us-east-1.amazonaws.com/wave/build:376ab7922461748c
...

We end up with a bunch of containers pushed into our ECR with no way to identify them. And we have no way to know what the origin of these containers was.

We need to be able to determine what the original public Wave container was that was used when these containers were deposited in our ECR for provenance tracking requirements.

A simple solution for this would be for Nextflow to save the wave "build logs" for the Wave containers as they are being generated. Alternatively, and perhaps an easier solution, would be for Nextflow to just keep a log of the original Wave container URI and its build_id when its using these Wave containers. Right now it does not seem like either of these things is happening.

Based on the current Wave API docs listed here https://docs.seqera.io/wave/api#get-v1alpha1buildsbuildidlogs it seems like its impossible to determine any actual details about a Wave container after its been built by Nextflow, because all queries appear to require the build_id, which Nextflow does not appear to be saving, despite the fact that it clearly should have the build_id at runtime because the build_id is part of the Wave API response when its accepting requests to build the containers.

So if Nextflow is not saving the details of the URI's of the public Wave containers it is generated, and its not saving the build_id for these containers, and its not tracking the URI of the copies of the containers its depositing in our ECR, it seems like it should do that so we can tell which containers in our ECR are which and where they originated from.

Thanks!

@stevekm
Copy link
Contributor Author

stevekm commented Feb 18, 2025

@ewels

@pditommaso
Copy link
Member

pditommaso commented Feb 19, 2025

You are right that currently there isn't an simple way to collect wave build logs for a nextflow execution.

A possible workaround using some Bash wrangler is the following:

  1. determine the list of containers in your run using the log command, for example
» nextflow log last -f container
wave.seqera.io/wt/646b5f55f72c/wave/build:salmon-1.10.3--1fc239026070d9d3
wave.seqera.io/wt/cd47a5157aac/wave/build:fastqc-0.12.1--662899d0b7bd9e15
wave.seqera.io/wt/646b5f55f72c/wave/build:salmon-1.10.3--1fc239026070d9d3
wave.seqera.io/wt/cde2c4cafb86/wave/build:multiqc-1.25--008d883aee47757c
  1. Then using the wave request id in the container name (the component after wt/) you can determine the buildId

  2. finally with the buildId download the logs (and also the scan logs along the same manner).

I'm including a Bash snippet for your convenience.

requestId=$1
buildId=$(curl https://wave.seqera.io/v1alpha2/container/${requestId} | jq .buildId -r)
curl https://wave.seqera.io/v1alpha1/builds/${buildId}/logs > wave-${buildId}.log

Said that, to improve this experience, the plan is to collect all containers metadata in Seqera Platform both for workflow launched using Nextflow CLI and via Platform itself.

This will make it possible to collect and search for all containers, builds and vulnerabilities scan in a single place for all workflow runs, either using the Platform UI or command line tool.

There's already a PR open on Nextflow to bring this metadata to Seqera Platform.

Let us know know if this solution works for you.

@stevekm
Copy link
Contributor Author

stevekm commented Feb 24, 2025

Thanks for this great reply, I was not aware that the part of the wave URI after the wt/.../wave was a request ID which could be used to get the build ID. I will test this method out and let you know if it works.

I do have concerns still about making Seqera Platform the solution for this instead of Nextflow itself. Because there are going to be users who dont have or want access to Platform who still need this feature to occur to satisfy e.g. Infosec requirements. So even if this is "solved" in Platform, its not actually a fix for Nextflow, and risks putting non-Platform users in a non-compliant state with company policies. Its also not really feasible to police users' command-line invocations of Nextflow either for this purpose to make sure they include the -with-tower cli args, which I suspect will not even work if they do not have a Platform account, which they cannot get until they go through an approval process. So basically, by "solving" this Platform only, it essentially blocks large amounts of users from being able to use Nextflow at all in a way that complies with requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants