Add reset algorithm to better handle crash loop. #160

somnusfish · 2022-02-18T02:53:31Z

Exponential backoff in retry with ~24 hours counter reset to prevent
the agent crashes infinitely. With each retry starting in Idle state,
this algorithm allows the agent to fetch the latest Bottlerocket for
each retry. By waiting the agent going back to Idle for recovery, this
algorithm prevents the controller brown out the cluster.

Issue number:
#123

Description of changes:

The agent shall signal that it has encountered an error while attempting a state transition before crashing by setting its state to ErrorReset.
When the controller progress node, if the controller detects that the agent has crashed, it shall set the spec state to Idle, record the current time as a timestamp variable in Bottlerocket update circle, increment the crash_count variable and emit failure metrics. The controller shall wait for the agent to uncordon the node if needed and set state to Idle before moving forward.
When find_and_update_ready_node, the controller shall tend to find and update nodes with lower crash_count.
When progress a node crashed before, the controller shall treat the node as usual node if it has reached 2^crash_count seconds since the first crash. Otherwise, the controller shall check if it reached 24 hours for reset. If it has been more than 24 hours since the first crash, the controller shall reset crash_count to 0 and crash timestamp to None, then treat the node as usual node. Otherwise the controller shall skip.
When the agent restarts, if the state is ErrorReset and spec is Idle, the agent shall uncordon the node based on the recorded previous state, and set the current status to Idle, else the agent shall try to process as usual.
Upon a successful Bottlerocket update circle, the crash_count variable shall be reset to 0 and crash time stamp shall be set to None.

Testing done:

Without crash

Provisioned 3 old hosts with Brupop installed, monitored cluster's update.

With crash

Intentionally make update() error to make sure StagedUpdate->PerformedUpdate always fails. Montoring the cluster to confirm the following feature:

Agent set the status to ErrorReset when something failed
Controller guide the agent to recover
Controller tend to pick the node with lower crash time
Controller retry using exponential backoff

Not covered in test

Controller reset the crash ~24 hours since the first crash.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

agent/src/agentclient.rs

cbgbt · 2022-02-28T17:24:13Z

agent/src/agentclient.rs

+                BottlerocketShadowState::ErrorReset => {
+                    // Spec should never be ErrorReset
+                    panic!("Spec state should never be ErrorReset");
+                }


I agree that panic! makes sense since this state indicates that the software is broken. We'd been handling cases like this elsewhere by making an Assertion error variant that we return, which makes sure that the software bubbles up through usual error handling code (like writing the termination log).

I might not understand what you meant. Do you mean something similar to

let token_path = token_path.to_str().context(error::Assertion { message: "Token path (defined in models/agent.rs) is not valid unicode.", })?;

But this just gives out an error to let the caller handle. In this case, we want the program to panic since this should not happen. The difference would just be panic here or the caller panic later.

Yeah. I think the idea is to have the caller handle gracefully terminating the program, since our outer error handler is writing to the termination log it will be properly logged. If we let the program panic here, it will be really difficult to determine why it failed.

Do you know if there is better way to write this?

agentclient_error::Error::Assertion { message: msg } => { return agentclient_error::Assertion { message: msg }.fail(); }

agent/src/agentclient.rs

controller/src/controller.rs

controller/src/statemachine.rs

models/src/node/mod.rs

cbgbt · 2022-03-11T16:02:24Z

agent/src/agentclient.rs

@@ -165,6 +165,8 @@ impl<T: APIServerClient> BrupopAgent<T> {
    async fn gather_system_metadata(


Nit: I think this function name may be showing that it was a strange choice. Maybe something like current_shadow_status would be clearer? I don't exactly love that name either. There may be something better.

I think we need to make it clear that this method refetch the system metadata. Maybe shadow_status_with_refetch_system_matadata ?

shadow_status_with_refreshed_system_matadata

I like that suggestion!

cbgbt · 2022-03-11T16:04:31Z

agent/src/agentclient.rs

+        crash_count: u32,
+        state_transition_failure_timestamp: Option<DateTime<Utc>>,


Something that might help with the proliferation of these arguments would be to encapsulate the error handling state in a struct, like ShadowErrorInfo or something like that. This could be something we implement the Default trait on so that we don't need to use the constants 0 and None when constructing the Status.

Sure, I can do that.

agent/src/agentclient.rs

cbgbt · 2022-03-11T16:15:42Z

agent/src/agentclient.rs

+                            crash_count = 0;
+                            state_transition_failure_timestamp = None;


I think we'll want to wait to apply these until the time that we are transitioning from MonitoringForUpdate into Idle.

Reason being is that we want to be able to add support for customers to add custom checks that the node is healthy (#12, which references this similar features in CLUO). Basically, we want any check that fails here to result the node being reverted to the previous version, and maintaining its crash loop state.

We can implement that functionality in a future release, but we should still make sure to only assign the fresh state after we've reached Idle.

Till here agent already handled the state transitioning from MonitoringForUpdate into Idle.

The logic for state transition is in self.handle_state_transition, this part is just the part to set the correct status to update. What you mentioned can be self.handle_state_transition and causing Err() with the return value which goes into the error handling part.

cbgbt · 2022-03-11T16:19:12Z

models/src/node/mod.rs

+    /// Order BottleRocketShadow based on crash_count in status
+    /// to determine the priority to be handled by the controller.
+    /// Uninitialized status should be considered as lowest priority.
+    pub fn compare_status(&self, other: &Self) -> Ordering {


Naming suggestion:

Suggested change

pub fn compare_status(&self, other: &Self) -> Ordering {

pub fn compare_crash_count(&self, other: &Self) -> Ordering {

I really like the use of cmp here, by the way :).

It's tricky to implement cmp for BottlerocketShadow though, that's why I did this method.

cbgbt · 2022-03-11T16:22:24Z

controller/src/statemachine.rs

 use tracing::instrument;
+use tracing::{event, Level};
+
+const RETRY_MAX_DELAY_IN_MINUTES: i64 = 24 * 60;


This should likewise be stored as a Duration to avoid having to encode the type in the name (MINUTES)

I believe Duration in chrono is not constants function, I have the following error:

calls in constants are limited to constant functions, tuple structs and tuple variants

The Duration in tokio and std::time seems only operate on seconds and below, I am kind of feel that could cause more confusion. So I will keep this here for the next version. We can discuss more here.

I think we'd want to use std::time::Duration. It does seem to mostly rely on second granularity for its interface, but I think it's usually a good idea to encode a measured unit in your type, rather than in your variable name.

I think with the lack of minute interface, I'm willing to disagree but ship it in spite of this.

controller/src/statemachine.rs

cbgbt · 2022-03-30T18:55:17Z

I think this is ready. There are a few nits, but they can be iteratively cleaned up or added here, but I'm overall happy with the behavior.

Exponential backoff in retry with max 24 hours waiting time to prevent the agent crashes infinitely. With each retry starting in Idle state, this algorithm allows the agent to fetch the latest Bottlerocket for each retry. By waiting the agent going back to Idle for recovery, this algorithm prevents the controller brown out the cluster.

somnusfish requested review from etungsten, webern, cbgbt and gthao313 February 18, 2022 02:53

cbgbt requested changes Feb 28, 2022

View reviewed changes

somnusfish force-pushed the crash_loop branch 3 times, most recently from abe9b80 to 0220758 Compare March 8, 2022 07:40

somnusfish requested a review from cbgbt March 8, 2022 08:01

cbgbt reviewed Mar 11, 2022

View reviewed changes

somnusfish force-pushed the crash_loop branch from 0220758 to 38bd652 Compare March 22, 2022 00:01

somnusfish requested a review from cbgbt March 28, 2022 23:29

somnusfish force-pushed the crash_loop branch from fcce26e to d0d7eb7 Compare March 30, 2022 00:47

cbgbt approved these changes Mar 30, 2022

View reviewed changes

etungsten approved these changes Mar 30, 2022

View reviewed changes

somnusfish force-pushed the crash_loop branch from d0d7eb7 to 9584be8 Compare March 30, 2022 20:11

somnusfish merged commit 8e74b1f into bottlerocket-os:develop Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reset algorithm to better handle crash loop. #160

Add reset algorithm to better handle crash loop. #160

somnusfish commented Feb 18, 2022

cbgbt Feb 28, 2022

somnusfish Mar 2, 2022

cbgbt Mar 3, 2022

somnusfish Mar 7, 2022

cbgbt Mar 11, 2022

somnusfish Mar 24, 2022

somnusfish Mar 24, 2022

cbgbt Mar 30, 2022

cbgbt Mar 11, 2022

somnusfish Mar 24, 2022

cbgbt Mar 11, 2022

somnusfish Mar 25, 2022

cbgbt Mar 11, 2022

somnusfish Mar 24, 2022

cbgbt Mar 11, 2022

somnusfish Mar 25, 2022

cbgbt Mar 30, 2022

cbgbt commented Mar 30, 2022

		@@ -165,6 +165,8 @@ impl<T: APIServerClient> BrupopAgent<T> {
		async fn gather_system_metadata(

		crash_count: u32,
		state_transition_failure_timestamp: Option<DateTime<Utc>>,

	pub fn compare_status(&self, other: &Self) -> Ordering {
	pub fn compare_crash_count(&self, other: &Self) -> Ordering {

Add reset algorithm to better handle crash loop. #160

Add reset algorithm to better handle crash loop. #160

Conversation

somnusfish commented Feb 18, 2022

Without crash

With crash

Not covered in test

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbgbt commented Mar 30, 2022