Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for seeding category data from S3 #959

Merged
merged 29 commits into from
May 10, 2023

Conversation

codemonkey800
Copy link
Collaborator

@codemonkey800 codemonkey800 commented Mar 22, 2023

Description

#866

Depends on #986

This PR adds a script for seeding category data from the S3 bucket. This works by adding a general purpose run_workflow.py script that can be run manually to run a particular workflow from CLI, or by importing the run_workflow() function. This will allow us to run the same scripts in our workflows locally and from the data-workflows lambda.

To seed category data from S3, run the run_workflow.py script:

AWS_PROFILE=sci-imaging \
  BUCKET=<bucket> \
  STACK_NAME=<stack-name>
  python run_workflow.py seed-s3-categories \
  --edam-version EDAM-BIOIMAGING:alpha06 \
  --s3_path category/EDAM-BIOIMAGING/alpha06.json

For example, this is the command I used for testing:

AWS_PROFILE=sci-imaging \
  BUCKET=napari-hub-dev \
  STACK_NAME=dev-test-cat \
  python run_workflow.py seed-s3-categories \
  --edam-version EDAM-BIOIMAGING:alpha06 \
  --s3_path category/EDAM-BIOIMAGING/alpha06.json

The workflow can also be run from Python using:

from run_workflow import run_workflow

run_workflow({
    'type': 'seed-s3-categories',
    'edam_version': 'EDAM-BIOIMAGING:alpha06',
    's3_path': 'category/EDAM-BIOIMAGING/alpha06.json',
})

Demos

List of first 50 categories in Dynamo

image

@codemonkey800 codemonkey800 added the maintenance Release Label: Used for categorizing maintenance PRs in automated CI release notes label Mar 22, 2023
Copy link
Collaborator

@manasaV3 manasaV3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool stuff.. 🎉

return round(time.time() * 1000)


class S3Client:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this be a class is really clean. 😁

@codemonkey800 codemonkey800 force-pushed the jeremy/seed-categories-script branch from 8fa36c7 to 542c6fd Compare March 29, 2023 02:01
@codemonkey800 codemonkey800 requested a review from manasaV3 March 29, 2023 02:01
klai95
klai95 previously approved these changes Mar 30, 2023
import os


def get_required_env(key: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool idea. 😀

from typing import Dict


def run_workflow(event: Dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! Having this workflow is super helpful for the future!

klai95
klai95 previously approved these changes Apr 4, 2023
@codemonkey800 codemonkey800 force-pushed the jeremy/seed-categories-script branch from 22ecad6 to abc4af6 Compare April 6, 2023 22:17
@codemonkey800 codemonkey800 changed the base branch from main to plugin-etl-base April 6, 2023 22:34
Comment on lines 30 to 42
if not version:
LOGGER.error(f"Missing 'version' for type={event_type}")
return

if not s3_path:
LOGGER.error(f"Missing 's3_path' for type={event_type}")
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@codemonkey800 codemonkey800 force-pushed the jeremy/seed-categories-script branch from abc4af6 to 56e863b Compare April 8, 2023 01:45
@codemonkey800 codemonkey800 changed the base branch from plugin-etl-base to main April 8, 2023 01:46
@codemonkey800 codemonkey800 requested review from manasaV3 and klai95 April 8, 2023 01:46
@manasaV3
Copy link
Collaborator

manasaV3 commented Apr 11, 2023

BLOCKER: I believe we were holding off on incorporating changes from #986 until Kevin's ETL PR was merged, to avoid conflicts that might be harder to resolve on his end.

@codemonkey800 codemonkey800 force-pushed the jeremy/seed-categories-script branch from 4bfce5a to fbdb3e9 Compare May 5, 2023 15:25
@codemonkey800 codemonkey800 requested review from klai95 and manasaV3 May 5, 2023 15:30
@codemonkey800 codemonkey800 dismissed manasaV3’s stale review May 5, 2023 15:31

feedback addressed

Copy link
Collaborator

@manasaV3 manasaV3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is missing unit tests. 😅

manasaV3
manasaV3 previously approved these changes May 8, 2023
Copy link
Collaborator

@manasaV3 manasaV3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality LGTM. But, might require unit test coverage.

@codemonkey800 codemonkey800 added new-feature Release Label: Used for categorizing features in automated release notes and removed maintenance Release Label: Used for categorizing maintenance PRs in automated CI release notes labels May 8, 2023
Copy link
Collaborator

@manasaV3 manasaV3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codemonkey800 codemonkey800 merged commit dcee8e1 into main May 10, 2023
@codemonkey800 codemonkey800 deleted the jeremy/seed-categories-script branch May 10, 2023 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Release Label: Used for categorizing features in automated release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants