Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke plugins lambda from backend for npe2 manifest discovery #673

Merged
merged 44 commits into from
Oct 14, 2022

Conversation

DragaDoncila
Copy link
Contributor

@DragaDoncila DragaDoncila commented Oct 3, 2022

As described in #620 this PR updates the manifest API endpoint handler to directly invoke the plugins lambda when the manifest file for a plugin & version is not already cached.

Check out the metadata for napari-blossom on the frontend:
https://dev-npe2-triggers-frontend.dev.imaging.cziscience.com/plugins/napari-blossom

@DragaDoncila
Copy link
Contributor Author

I've now let this branch run on all plugins in dev-npe2-triggers for the first time. Just wanted to leave a few notes here:

  • The initial cloudwatch update triggered 235 lambda invocations - this is consistent with the number of currently listed plugins (232 - I have no idea where the extra 3 come from...).
  • Of 235 invocations, 13 errored. All errors were timeout/max memory usage errors from the lambda. This was semi-expected as I had just lowered the resource allocations to their minimums for testing purposes. All failing plugins do not provide a wheel distribution, so the fetch process has to build it (hence the increased resource requirements).
  • Even with the 13 invocations timing out at 60 seconds, the average duration of the initial run of 235 invocations was 4.3s.
  • After ~25 minutes, all but 19 plugins had complete metadata information in index.json - roughly consistent with the 13 errored lambdas, though I will look into the other 6.

The frustrating thing is that despite the lambda being configured not to retry, 12 of the 13 initially failing invocations just kept being reinvoked - more than 6 times?! I can't explain that, honestly. They just keep going?

I will upgrade the memory allocation and see how that affects the remaining failing invocations, and investigate the 6 plugin difference missing in index.json.

@DragaDoncila
Copy link
Contributor Author

Increased memory size to 256MB and plugins are now still timing out (but memory usage reported at 200MB). Will update timeout to 2 min.

@DragaDoncila
Copy link
Contributor Author

Ok after those updates all lambdas completed successfully. We still have 12 plugins with no manifest metadata in index.json and I've summarized them below:

Error message written to manifest

These might need to be retried now that I've updated memory size.

napari-yolov5 - no space left on device (lambda memory error)
platelet-unet-watershed - [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/var/task']
zarpaint - [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/var/task']
nfinder - [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/var/task']

Missing in npe2api - legitimate errors

skeleton-finder
napari-mri

Actually 'empty' manifests

napari-error-reporter - napari package, not a plugin
napari-console - napari package, not a plugin
beetlesafari - accurate, doesn't return any hook implementations
devbio-napari - accurate, napari tools menu stuff

Potential fetch error

napari-mat-images - does have a hook_implementation, fetch finds no contributions
napari-features - does have a hook_implementation, fetch finds no contributions

@richaagarwal
Copy link
Collaborator

@DragaDoncila Thanks so much for these updates! Regarding the 12 plugins with no manifest data/ that failed, did the retries (even though they're not supposed to happen at all) eventually settle to around 6 times or are they still being retried?

Also - let me know when this PR is ready for review!

@DragaDoncila
Copy link
Contributor Author

Regarding the 12 plugins with no manifest data/ that failed, did the retries (even though they're not supposed to happen at all) eventually settle to around 6 times or are they still being retried?

@richaagarwal just summarizing here from our discussion today. All lambdas ceased to be retried after I updated the resource allocations, as there were no external lambda errors e.g. timeouts or memory issues. The reason they were retrying is not default aws behavior, it was because our manifest files were never being written, so our cloudwatch cron job kept leading to the lambda being invoked.

As discussed, this is not great as we're relying on file existence as a measure of idempotency. I've now updated the lambda handler to write an empty file at the beginning of discovery, and overwrite it with either an error message or the real manifest once discovery is complete. This means even an unsuccessful lambda termination doesn't lead to retry, as the file will exist. I've also updated the error messages returned from the API to reflect this.

As a future piece of work if we don't like the fact that we're overwriting the file, we can update the python handler to time its own execution and write an error message just before the lambda would time out.

Finally we discussed cleaning the prod and staging environments of existing manifest files placed there by the original system, and we decided it would be fine to manually run a command to clear these out from staging just before we merge this PR, and from prod soon after. This will allow the new method to populate all manifests correctly.

For people's reference, the command to print all files that would be deleted is:

aws s3 rm s3://napari-hub-dev/dev-npe2-triggers/cache/ --recursive --exclude "*" --include "*-manifest.json" --dryrun

I've run this command in the dev branch bucket to watch another run of invocations on the full plugin list. This time a single plugin timed out (napari-yolov5) while attempting to build the wheel, and I've confirmed that the plugins lambda doesn't get invoked again. I've updated the timeout to 2.5min (as a subsequent run for this plugin almost finished), and hopefully that will take care of this last plugin.

Regardless, I think this PR is ready for review!

@DragaDoncila DragaDoncila marked this pull request as ready for review October 12, 2022 07:56
@DragaDoncila
Copy link
Contributor Author

small update: napari-yolov5 finished successfully in 124s - so close 😆

@DragaDoncila DragaDoncila changed the title [WIP] Invoke plugins lambda from backend for npe2 manifest discovery Invoke plugins lambda from backend for npe2 manifest discovery Oct 12, 2022
@richaagarwal
Copy link
Collaborator

richaagarwal commented Oct 12, 2022

@DragaDoncila I left some comments, but after reviewing further, I realized there are a few things with the previous structure we had in place that might be problematic with this new code.

As the PR currently is, the plugins lambda could theoretically be invoked outside of the cron job, which isn't ideal. We'd ideally avoid that possibility by separating out calls to the S3 cache vs. calls to get_manifest, since get_manifest is now directly invoking the plugins lambda.

It also looks like get_frontend_manifest_metadata will need to be updated regardless since it's still referencing the old process_count.

@DragaDoncila
Copy link
Contributor Author

@richaagarwal thanks for the thorough review! I've updated the logic now so that get_manifest just retrieves the manifest from cache. I've added a new method discover_manifest which is only called from build_manifest_metadata when get_manifest returns a "Manifest not yet processed" message.

I've verified that the plugins lambda is no longer invoked when, for example, we make a call to manifest/{name} in the api, but that it is successfully invoked during the cron job.

Copy link
Collaborator

@richaagarwal richaagarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Thanks so much for your thorough work & testing on this. The code comments really help with readability too 💛

I wonder if we should also add staging to the feature flags here: https://github.com/chanzuckerberg/napari-hub/blob/main/frontend/src/utils/featureFlags.ts#L27. I'll leave it up to you if you want to add that before merging or not, or if you prefer to open a separate PR for that.

@DragaDoncila
Copy link
Contributor Author

@richaagarwal I'll update the feature flag here to enable on staging. Before I merge I'll also delete the -manifest.json files on staging.

@DragaDoncila
Copy link
Contributor Author

Synced with @richaagarwal about failing test and as it seems unrelated, I will go ahead and merge this now!

@DragaDoncila DragaDoncila merged commit 8143eff into main Oct 14, 2022
@DragaDoncila DragaDoncila deleted the dev-npe2-triggers branch October 14, 2022 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants