feat: support reloading workflows at runtime #3180

didier-wenzek · 2024-10-09T16:57:20Z

Proposed changes

In order to support dynamic reloading of workflow, without breaking a running workflow,
the proposal is to:

use a hash of the workflow definition file to distinguish workflow versions
persist a copy of each version for a given operation
make the copy when a command instance is triggered
use reference-counting to remove copies when no more in-use.

Plan:

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#3156

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2024-10-09T17:09:10Z

Codecov Report

Attention: Patch coverage is 41.17647% with 340 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ore/tedge_agent/src/operation_workflows/persist.rs	34.22%	210 Missing and 11 partials ⚠️
crates/core/tedge_api/src/workflow/supervisor.rs	38.60%	93 Missing and 4 partials ⚠️
.../core/tedge_agent/src/operation_workflows/actor.rs	45.45%	17 Missing and 1 partial ⚠️
...core/tedge_agent/src/operation_workflows/config.rs	0.00%	3 Missing ⚠️
...tes/core/tedge_agent/src/state_repository/state.rs	93.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

📢 Thoughts on this report? Let us know!

github-actions · 2024-10-09T17:30:46Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
520	0	2	520	100	1h39m10.086467s

didier-wenzek · 2024-10-10T07:53:52Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

+) -> Result<(OperationWorkflow, WorkflowVersion), anyhow::Error> {
+    let bytes = tokio::fs::read(path).await.context("Fail to read file")?;
+    let input = std::str::from_utf8(&bytes).context("Fail to extract UTF8 content")?;
+    let version = sha256::digest(input);


Using md5 should be enough here - as there is no crypto concerns.

crates/core/tedge_agent/src/operation_workflows/persist.rs

Bravo555 · 2024-10-23T11:34:28Z

tests/RobotFramework/tests/tedge_agent/workflows/dynamic_workflow_reloading.robot

+    ...    item="@version":"76e9afe834b4a7cadc9029670ba76745fcda73784f9e78c09f0c0416f7f58ad2"
+
+Recover Builtin Operation
+    ThinEdgeIO.File Should Exist    /etc/tedge/operations/software_list.toml


suggestion: the test fails if run by itself and not part of the test suite, because this file is created by the previous test case. Could we instead use Transfer to Device everywhere so that there are no dependencies between the test cases? I'd expect Transfer to Device to overwrite the file if it already exists, so it should be okay to use that?

I have a mix feeling here. On one side, you are correct, it would be handy to have independent tests. But, on the other side, this test suite represents well a scenario where a user creates and iterate updating a workflow file.

Concretely, replacingFile Should Exist assertion by a Transfer to Device command would lead to a different test while running the suite vs the isolated test. Indeed, in the suite case, one checks that a user can update a workflow while, in the isolated case, one checks that the user can create a workflow (i.e. Update User-Defined Operation doing the same test as Create User-Defined Operation).

Bravo555

Some nits, but LGTM overall.

Bravo555 · 2024-10-23T11:46:40Z

tests/RobotFramework/tests/tedge_agent/workflows/dynamic_workflow_reloading.robot

+    ${workflow_log}    Execute Command    cat /var/log/tedge/agent/workflow-user-command-dyn-test-1.log
+    Should Contain
+    ...    ${workflow_log}
+    ...    item="@version":"37d0861e3038b34e8ab2ffe3257dd9372213ed5e17ba352e5028b0bf9762a089"


nit(non-blocking): actual SHA256 is an implementation detail, we don't need to compare full value, only that it changed between different versions of the workflow

Also if the toml file changes this value will have to be updated.

It's a bit of a nitpick, but a comment would help because it's not obvious that it's SHA256 hash of user-command-v1.toml and why we're comparing it

crates/core/tedge_api/src/workflow/supervisor.rs

Bravo555 · 2024-10-23T12:34:38Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

+use anyhow::Context;
+use camino::Utf8Path;
+use camino::Utf8PathBuf;
+use log::error;


Suggested change

use log::error;

use tracing::error;

Fixed 41f4208

Bravo555 · 2024-10-23T12:45:27Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

thought: this module has quite a bit of functionality but doesn't have any unit tests - codecov reports 210 missed lines and 34.2% patch coverage (from other workflow-related tests)

I can only acknowledge that the unit test coverage is poor. However, this code is quit extensively tested by the system test suite added by this PR (Dynamic Workflow Reloading). I opted for system-tests instead of unit-tests because the features introduced by this module are heavily related to the file system and inotify as well as sequence of user actions (adding/updating/removing files while the agent is running/restarted. One place where unit tests can be improved is the tedge_api::workflow::supervisor module which provide the in-memory representation of the uploaded workflow definitions.

Signed-off-by: Didier Wenzek <[email protected]>

This is an intermediate step, the aim being to use the same directory to persist a copy of the workflows currently used (i.e. for which there is a running operation instance). Signed-off-by: Didier Wenzek <[email protected]>

For this first step the behavior is unchanged: the workflows are only loaded on start Signed-off-by: Didier Wenzek <[email protected]>

…r engine The WorkflowRepository acts as a facade to WorkflowSupervisor adding all disk related features: loading definitions from disk, caching definitions in-use, reloading definitions on changes. Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

A workflow source being always used with a complementary info: a file path or a workflow version, it makes sense to pack the complementary info within the WorkflowSource itself. This also highlights the corner case of the BuiltIn workflow for which there is no complementary info. Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

didier-wenzek temporarily deployed to Test Pull Request October 9, 2024 16:57 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 9, 2024 17:13 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 10, 2024 07:50 — with GitHub Actions Inactive

didier-wenzek commented Oct 10, 2024

View reviewed changes

didier-wenzek temporarily deployed to Test Auto October 10, 2024 07:55 — with GitHub Actions Inactive

reubenmiller added the theme:workflows Theme: Workflow engine topics label Oct 10, 2024

didier-wenzek temporarily deployed to Test Pull Request October 10, 2024 13:09 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 10, 2024 13:15 — with GitHub Actions Failure

didier-wenzek commented Oct 10, 2024

View reviewed changes

crates/core/tedge_agent/src/operation_workflows/persist.rs Outdated Show resolved Hide resolved

crates/core/tedge_agent/src/operation_workflows/persist.rs Show resolved Hide resolved

didier-wenzek temporarily deployed to Test Pull Request October 11, 2024 08:28 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 11, 2024 08:34 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 11, 2024 13:31 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 11, 2024 13:37 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 14:17 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 14:23 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 15:29 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto October 14, 2024 15:37 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 16:35 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 16:41 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 17:22 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 17:28 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 18:54 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 19:00 — with GitHub Actions Failure

didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from eb59902 to d0a5cde Compare October 15, 2024 07:26

didier-wenzek temporarily deployed to Test Pull Request October 15, 2024 07:26 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 15, 2024 07:32 — with GitHub Actions Failure

didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from d0a5cde to b0454bc Compare October 15, 2024 08:13

didier-wenzek temporarily deployed to Test Pull Request October 15, 2024 08:13 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request October 23, 2024 11:30 — with GitHub Actions Inactive

Bravo555 reviewed Oct 23, 2024

View reviewed changes

didier-wenzek temporarily deployed to Test Auto October 23, 2024 11:36 — with GitHub Actions Inactive

Bravo555 approved these changes Oct 23, 2024

View reviewed changes

didier-wenzek temporarily deployed to Test Pull Request October 23, 2024 13:17 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto October 23, 2024 13:22 — with GitHub Actions Inactive

Bravo555 removed their assignment Oct 23, 2024

didier-wenzek added 16 commits October 23, 2024 15:52

Attach versions to operation workflows and commands

3b78959

Signed-off-by: Didier Wenzek <[email protected]>

Move on-disk workflow representation in a sub-module

a83642f

Signed-off-by: Didier Wenzek <[email protected]>

Make pub the logic used to check the agent state dir

bc928e8

This is an intermediate step, the aim being to use the same directory to persist a copy of the workflows currently used (i.e. for which there is a running operation instance). Signed-off-by: Didier Wenzek <[email protected]>

Group workflow loading logic in struct WorkflowRepository

446950d

For this first step the behavior is unchanged: the workflows are only loaded on start Signed-off-by: Didier Wenzek <[email protected]>

Persist operation definition when a new instance is created

6cc005c

Signed-off-by: Didier Wenzek <[email protected]>

Reload workflow definitions on file change using inotify

dca2f6f

Signed-off-by: Didier Wenzek <[email protected]>

Unregister user-defined workflows which definitions are removed

30249b4

Signed-off-by: Didier Wenzek <[email protected]>

Restore builtin definition when a user defined workflow is removed

434f8db

Signed-off-by: Didier Wenzek <[email protected]>

Update capability messages on operation workflow updates

2f96370

Test reloading workflows at runtime

79fc269

Signed-off-by: Didier Wenzek <[email protected]>

A main workflow can update a sub-workflow before using it

1467f18

Signed-off-by: Didier Wenzek <[email protected]>

A new command instance must use the latest workflow version

651a0b7

Signed-off-by: Didier Wenzek <[email protected]>

Support concurrent instances with difference versions

55aef04

Signed-off-by: Didier Wenzek <[email protected]>

Remove copies of in-use workflow when no more used

d70209e

Signed-off-by: Didier Wenzek <[email protected]>

didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from 41f4208 to d70209e Compare October 23, 2024 14:02

didier-wenzek temporarily deployed to Test Pull Request October 23, 2024 14:02 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto October 23, 2024 14:08 — with GitHub Actions Inactive

didier-wenzek added this pull request to the merge queue Oct 23, 2024

Merged via the queue into thin-edge:main with commit 484e54f Oct 23, 2024
33 checks passed

didier-wenzek deleted the feat/load-operation-workflows-on-updates branch October 23, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support reloading workflows at runtime #3180

feat: support reloading workflows at runtime #3180

didier-wenzek commented Oct 9, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading

didier-wenzek Oct 10, 2024

Bravo555 Oct 23, 2024

didier-wenzek Oct 23, 2024

Bravo555 left a comment

Bravo555 Oct 23, 2024

Bravo555 Oct 23, 2024

didier-wenzek Oct 23, 2024

Bravo555 Oct 23, 2024

didier-wenzek Oct 23, 2024

feat: support reloading workflows at runtime #3180

feat: support reloading workflows at runtime #3180

Conversation

didier-wenzek commented Oct 9, 2024 • edited Loading

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 9, 2024 • edited Loading

Robot Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bravo555 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

didier-wenzek commented Oct 9, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading