Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️Ensure parent project/node is well structured in the DB 🗃️ #5874

Merged
merged 62 commits into from
Jun 3, 2024

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented May 27, 2024

What do these changes do?

This is the first part toward structuring node parenting.

Context

In Sim4life.io currently the API client PATCHES the created computational jobs with some free form JSON metadata. Among these is a node_id field which contains the parent Node (e.g. the node where s4l is running). The DV-2 uses that information to define the parent project/node by looking for the presence of that field. This is currently not defined anywhere and could break very easily.

Goal

Bring structure to this parenting, and allow other products to use this as easy as possible.

Idea

The oSparc API client shall automagically find out when it is running from an oSparc node by checking for some pre-defined ENV variables. For other free form API client, the usage relies on the respective authors responsibility (e.g. sim4life).

Constraint

Keep backward compatibility

Changes

  • Project creation POST /projects now accept a couple of headers:
    • X-Simcore-Parent-Project-Uuid which optionally contains the parent project UUID
    • X-Simcore-Parent-Node-Id which optionally contains the parent node ID
    • if both are given then the parenting structure will be setup in the projects_metadata table and used as before
    • if none are given or are both null then nothing happens
    • if they are invalid an error will be returned (404 - NotFound)
    • if only one of them is filled then an error will be returned (422 - Unprocessable)
  • to keep backward compatibility, that means the the project was created without parents,
  • and PATCH /projects/{uuid}/metadata is called with custom_metadata containing a node_id field:
    • This will set the parenting in case the passed node_id is found,
    • and that there is not already parenting set beforehand

The next steps will be re-arranging the DV-2 to directly get the information regarding parent project/node from the DB instead of extracting the information from the metadata, and taking care of the oSparc client (@bisgaard-itis )

Related issue/s

How to test

Dev-ops checklist

@sanderegg sanderegg added a:webserver issue related to the webserver service a:director-v2 issue related with the director-v2 service a:dask-service Any of the dask services: dask-scheduler/sidecar or worker labels May 27, 2024
@sanderegg sanderegg added this to the Leeroy Jenkins milestone May 27, 2024
@sanderegg sanderegg self-assigned this May 27, 2024
Copy link

codecov bot commented May 27, 2024

Codecov Report

Attention: Patch coverage is 94.42060% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 87.7%. Comparing base (cafbf96) to head (048f90b).
Report is 242 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #5874      +/-   ##
=========================================
+ Coverage    84.5%   87.7%    +3.2%     
=========================================
  Files          10    1398    +1388     
  Lines         214   57580   +57366     
  Branches       25    1330    +1305     
=========================================
+ Hits          181   50554   +50373     
- Misses         23    6743    +6720     
- Partials       10     283     +273     
Flag Coverage Δ
integrationtests 64.7% <48.1%> (?)
unittests 85.8% <94.4%> (+1.2%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...core_postgres_database/models/projects_metadata.py 100.0% <100.0%> (ø)
...mcore_postgres_database/utils_projects_metadata.py 100.0% <100.0%> (ø)
.../simcore_postgres_database/utils_projects_nodes.py 90.0% <100.0%> (ø)
...rary/src/servicelib/aiohttp/requests_validation.py 87.5% <100.0%> (ø)
...s/service-library/src/servicelib/common_headers.py 100.0% <100.0%> (ø)
...ore_service_webserver/projects/_crud_api_create.py 98.3% <100.0%> (ø)
...mcore_service_webserver/projects/_crud_handlers.py 91.3% <100.0%> (ø)
...ervice_webserver/projects/_crud_handlers_models.py 98.0% <100.0%> (ø)
...imcore_service_webserver/projects/_metadata_api.py 100.0% <100.0%> (ø)
...c/simcore_service_webserver/projects/exceptions.py 93.4% <100.0%> (ø)
... and 5 more

... and 1353 files with indirect coverage changes

@sanderegg sanderegg changed the title ♻️Ensure parent project/node is well structured in the DB ♻️Ensure parent project/node is well structured in the DB 🗃️ May 27, 2024
@sanderegg sanderegg marked this pull request as ready for review May 27, 2024 17:00
Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, thanks! Let's just discuss the project_node_id. I think then you would not need to deal with the ProjectNodesNonUniqueNodeFoundError case.

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx.
This PR extends projects_metadata tables with structured metadata parent_project_uuid and parent_node_id.
Some of these values where already injected as custom metadata by S4L and compatibilyt for that case is supported.

I wonder whether this parenthood relation might also be used to map a study_id with its job_id in the new study resource in the api-server ...

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, I have some questions, since I could not figure out a few things.

Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. Thanks 🚀

I am wondering if it would make sense to put this new data into a new table because I think we will want to store much more of this type of relational data soon. But can of course also be handled later.

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented May 29, 2024

🚨 Q: @sanderegg @bisgaard-itis @pcrespov
I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

@bisgaard-itis
Copy link
Contributor

🚨 Q: @sanderegg @bisgaard-itis @pcrespov I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

Couldn't this be resolved by following the parent_ids and adding up the costs along the way? Or is this operation too costly?

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented May 29, 2024

🚨 Q: @sanderegg @bisgaard-itis @pcrespov I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

Couldn't this be resolved by following the parent_ids and adding up the costs along the way? Or is this operation too costly?

Yes, for frontend listing, filtering, and CSV export, this operation needs to be fast. Also, I cannot imagine what the query would look like, you would need to have some kind of loop with a lot of joining, which is already a red flag for me. Basically, we need to keep the root project ID somewhere.

@bisgaard-itis
Copy link
Contributor

🚨 Q: @sanderegg @bisgaard-itis @pcrespov I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

Couldn't this be resolved by following the parent_ids and adding up the costs along the way? Or is this operation too costly?

Yes, for frontend listing, filtering, and CSV export, this operation needs to be fast. Also, I cannot imagine what the query would look like, you would need to have some kind of loop with a lot of joining, which is already a red flag for me. Basically, we need to keep the root project ID somewhere.

What I imagine is something like the following: You first extract the projects which are "root projects" (meaning projects which don't have parents I guess). And then in a loop you look for their children, grandchildren, grandgrandchildren etc. So I guess the question is if that loop would be on the server or in the client. If the loop is on the client side, then the individual queries to the db would be quite fast I think 🤔 (getting all children of a given project_id). But of course if the loop is on the client side then it might not be super fast to resolve the entire thing.

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented May 29, 2024

🚨 Q: @sanderegg @bisgaard-itis @pcrespov I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

Couldn't this be resolved by following the parent_ids and adding up the costs along the way? Or is this operation too costly?

Yes, for frontend listing, filtering, and CSV export, this operation needs to be fast. Also, I cannot imagine what the query would look like, you would need to have some kind of loop with a lot of joining, which is already a red flag for me. Basically, we need to keep the root project ID somewhere.

What I imagine is something like the following: You first extract the projects which are "root projects" (meaning projects which don't have parents I guess). And then in a loop you look for their children, grandchildren, grandgrandchildren etc. So I guess the question is if that loop would be on the server or in the client. If the loop is on the client side, then the individual queries to the db would be quite fast I think 🤔 (getting all children of a given project_id). But of course if the loop is on the client side then it might not be super fast to resolve the entire thing.

Not sure whether we are on the same page. I am talking about this view:
image
which needs to be paginated. Users should be able to sort and filter it based on their needs, and it should also be exportable as a CSV file. For efficiency, these requirements should ideally be satisfied with a single query to the database, mainly for pagination purposes. Storing root parent id somewhere totally solves the problem. Do not forget you are not doing this operation for one specific project, but across all that you are listing.

I would propose for example having it as additional column in the projects_metadata table:

Project ID Parent Project ID Root Parent Project ID
A NULL NULL
B A A
C B A

We will just ask for root parent project id of a parent project id when inserting a new row into this table and that's it. Lets discuss this @sanderegg. Thanks.

@sanderegg
Copy link
Member Author

🚨 Q: @sanderegg @bisgaard-itis @pcrespov I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

Couldn't this be resolved by following the parent_ids and adding up the costs along the way? Or is this operation too costly?

Yes, for frontend listing, filtering, and CSV export, this operation needs to be fast. Also, I cannot imagine what the query would look like, you would need to have some kind of loop with a lot of joining, which is already a red flag for me. Basically, we need to keep the root project ID somewhere.

What I imagine is something like the following: You first extract the projects which are "root projects" (meaning projects which don't have parents I guess). And then in a loop you look for their children, grandchildren, grandgrandchildren etc. So I guess the question is if that loop would be on the server or in the client. If the loop is on the client side, then the individual queries to the db would be quite fast I think 🤔 (getting all children of a given project_id). But of course if the loop is on the client side then it might not be super fast to resolve the entire thing.

Not sure whether we are on the same page. I am talking about this view: image which needs to be paginated. Users should be able to sort and filter it based on their needs, and it should also be exportable as a CSV file. For efficiency, these requirements should ideally be satisfied with a single query to the database, mainly for pagination purposes. Storing root parent id somewhere totally solves the problem. Do not forget you are not doing this operation for one specific project, but across all that you are listing.

I would propose for example having it as additional column in the projects_metadata table:

Project ID Parent Project ID Root Parent Project ID
A NULL NULL
B A A
C B A
We will just ask for root parent project id of a parent project id when inserting a new row into this table and that's it. Lets discuss this @sanderegg. Thanks.

for completeness of the discussion, I agreed to add the root project ID/root project Node ID

@bisgaard-itis
Copy link
Contributor

🚨 Q: @sanderegg @bisgaard-itis @pcrespov I just realized that for grouping purposes, we should have a concept of a "root parent." The user wants to see the usage and pricing of their project, regardless of the nested structure of projects (which is an implementation detail on our side). To get and group this data efficiently, we need to store the root parent IDs. What do you think?

Couldn't this be resolved by following the parent_ids and adding up the costs along the way? Or is this operation too costly?

Yes, for frontend listing, filtering, and CSV export, this operation needs to be fast. Also, I cannot imagine what the query would look like, you would need to have some kind of loop with a lot of joining, which is already a red flag for me. Basically, we need to keep the root project ID somewhere.

What I imagine is something like the following: You first extract the projects which are "root projects" (meaning projects which don't have parents I guess). And then in a loop you look for their children, grandchildren, grandgrandchildren etc. So I guess the question is if that loop would be on the server or in the client. If the loop is on the client side, then the individual queries to the db would be quite fast I think 🤔 (getting all children of a given project_id). But of course if the loop is on the client side then it might not be super fast to resolve the entire thing.

Not sure whether we are on the same page. I am talking about this view: image which needs to be paginated. Users should be able to sort and filter it based on their needs, and it should also be exportable as a CSV file. For efficiency, these requirements should ideally be satisfied with a single query to the database, mainly for pagination purposes. Storing root parent id somewhere totally solves the problem. Do not forget you are not doing this operation for one specific project, but across all that you are listing.
I would propose for example having it as additional column in the projects_metadata table:
Project ID Parent Project ID Root Parent Project ID
A NULL NULL
B A A
C B A
We will just ask for root parent project id of a parent project id when inserting a new row into this table and that's it. Lets discuss this @sanderegg. Thanks.

for completeness of the discussion, I agreed to add the root project ID/root project Node ID

Sounds good to me. Thanks

@sanderegg sanderegg force-pushed the tipv3/add-parent-node-id branch from 77cd689 to 0047e64 Compare May 30, 2024 17:18
@sanderegg sanderegg force-pushed the tipv3/add-parent-node-id branch from c8d096a to b507a77 Compare May 31, 2024 15:08
@sanderegg sanderegg requested review from GitHK and pcrespov May 31, 2024 16:03
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment I noticed when propagating these changes to the api-server

Copy link

sonarqubecloud bot commented Jun 3, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@sanderegg sanderegg merged commit b77814a into ITISFoundation:master Jun 3, 2024
57 checks passed
@sanderegg sanderegg deleted the tipv3/add-parent-node-id branch June 3, 2024 08:10
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service a:webserver issue related to the webserver service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants