Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support reloading workflows at runtime #3180

Conversation

didier-wenzek
Copy link
Contributor

@didier-wenzek didier-wenzek commented Oct 9, 2024

Proposed changes

In order to support dynamic reloading of workflow, without breaking a running workflow,
the proposal is to:

  • use a hash of the workflow definition file to distinguish workflow versions
  • persist a copy of each version for a given operation
  • make the copy when a command instance is triggered
  • use reference-counting to remove copies when no more in-use.

Plan:

  • Attach a version to each workflow, this version being a hash of the definition file.
  • Attach a version to each operation instance, taking the version of the workflow definition in use.
  • Refactor the agent to group the workflow loading logic in a single place.
  • When a new operation instance is created, copy its definition in a persisted directory of workflows in-use.
  • On-start load not only the user-provided workflow definition, but also those of the workflows in-use.
  • Reload workflow definitions on file change using inotify.
  • On file change, reload the workflow definition (without erasing any definition still in-use).
  • When a user-defined workflow is removed, clear the associated capability (letting any still running operation run to completion).
  • When a builtin workflow specialization is removed, restore the builtin version if any.
  • When a workflow definition is removed, keep a copy of that definition if still in-use.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#3156

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link

codecov bot commented Oct 9, 2024

Copy link
Contributor

github-actions bot commented Oct 9, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
520 0 2 520 100 1h39m10.086467s

) -> Result<(OperationWorkflow, WorkflowVersion), anyhow::Error> {
let bytes = tokio::fs::read(path).await.context("Fail to read file")?;
let input = std::str::from_utf8(&bytes).context("Fail to extract UTF8 content")?;
let version = sha256::digest(input);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using md5 should be enough here - as there is no crypto concerns.

@didier-wenzek didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from eb59902 to d0a5cde Compare October 15, 2024 07:26
@didier-wenzek didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from d0a5cde to b0454bc Compare October 15, 2024 08:13
... item="@version":"76e9afe834b4a7cadc9029670ba76745fcda73784f9e78c09f0c0416f7f58ad2"

Recover Builtin Operation
ThinEdgeIO.File Should Exist /etc/tedge/operations/software_list.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: the test fails if run by itself and not part of the test suite, because this file is created by the previous test case. Could we instead use Transfer to Device everywhere so that there are no dependencies between the test cases? I'd expect Transfer to Device to overwrite the file if it already exists, so it should be okay to use that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a mix feeling here. On one side, you are correct, it would be handy to have independent tests. But, on the other side, this test suite represents well a scenario where a user creates and iterate updating a workflow file.

Concretely, replacingFile Should Exist assertion by a Transfer to Device command would lead to a different test while running the suite vs the isolated test. Indeed, in the suite case, one checks that a user can update a workflow while, in the isolated case, one checks that the user can create a workflow (i.e. Update User-Defined Operation doing the same test as Create User-Defined Operation).

Copy link
Contributor

@Bravo555 Bravo555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but LGTM overall.

${workflow_log} Execute Command cat /var/log/tedge/agent/workflow-user-command-dyn-test-1.log
Should Contain
... ${workflow_log}
... item="@version":"37d0861e3038b34e8ab2ffe3257dd9372213ed5e17ba352e5028b0bf9762a089"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit(non-blocking): actual SHA256 is an implementation detail, we don't need to compare full value, only that it changed between different versions of the workflow

Also if the toml file changes this value will have to be updated.

It's a bit of a nitpick, but a comment would help because it's not obvious that it's SHA256 hash of user-command-v1.toml and why we're comparing it

use anyhow::Context;
use camino::Utf8Path;
use camino::Utf8PathBuf;
use log::error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use log::error;
use tracing::error;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 41f4208

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: this module has quite a bit of functionality but doesn't have any unit tests - codecov reports 210 missed lines and 34.2% patch coverage (from other workflow-related tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only acknowledge that the unit test coverage is poor. However, this code is quit extensively tested by the system test suite added by this PR (Dynamic Workflow Reloading). I opted for system-tests instead of unit-tests because the features introduced by this module are heavily related to the file system and inotify as well as sequence of user actions (adding/updating/removing files while the agent is running/restarted. One place where unit tests can be improved is the tedge_api::workflow::supervisor module which provide the in-memory representation of the uploaded workflow definitions.

@Bravo555 Bravo555 removed their assignment Oct 23, 2024
This is an intermediate step, the aim being to use the same directory to
persist a copy of the workflows currently used (i.e. for which there is
a running operation instance).

Signed-off-by: Didier Wenzek <[email protected]>
For this first step the behavior is unchanged:
the workflows are only loaded on start

Signed-off-by: Didier Wenzek <[email protected]>
…r engine

The WorkflowRepository acts as a facade to WorkflowSupervisor adding all
disk related features: loading definitions from disk, caching
definitions in-use, reloading definitions on changes.

Signed-off-by: Didier Wenzek <[email protected]>
A workflow source being always used with a complementary info:
a file path or a workflow version, it makes sense to pack the
complementary info within the WorkflowSource itself.
This also highlights the corner case of the BuiltIn workflow for which
there is no complementary info.

Signed-off-by: Didier Wenzek <[email protected]>
@didier-wenzek didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from 41f4208 to d70209e Compare October 23, 2024 14:02
@didier-wenzek didier-wenzek added this pull request to the merge queue Oct 23, 2024
Merged via the queue into thin-edge:main with commit 484e54f Oct 23, 2024
33 checks passed
@didier-wenzek didier-wenzek deleted the feat/load-operation-workflows-on-updates branch October 23, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme:workflows Theme: Workflow engine topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants