Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reset algorithm to better handle crash loop. #160

Merged
merged 1 commit into from
Mar 30, 2022

Conversation

somnusfish
Copy link
Contributor

Exponential backoff in retry with ~24 hours counter reset to prevent
the agent crashes infinitely. With each retry starting in Idle state,
this algorithm allows the agent to fetch the latest Bottlerocket for
each retry. By waiting the agent going back to Idle for recovery, this
algorithm prevents the controller brown out the cluster.

Issue number:
#123

Description of changes:

  • The agent shall signal that it has encountered an error while attempting a state transition before crashing by setting its state to ErrorReset.
  • When the controller progress node, if the controller detects that the agent has crashed, it shall set the spec state to Idle, record the current time as a timestamp variable in Bottlerocket update circle, increment the crash_count variable and emit failure metrics. The controller shall wait for the agent to uncordon the node if needed and set state to Idle before moving forward.
  • When find_and_update_ready_node, the controller shall tend to find and update nodes with lower crash_count.
  • When progress a node crashed before, the controller shall treat the node as usual node if it has reached 2^crash_count seconds since the first crash. Otherwise, the controller shall check if it reached 24 hours for reset. If it has been more than 24 hours since the first crash, the controller shall reset crash_count to 0 and crash timestamp to None, then treat the node as usual node. Otherwise the controller shall skip.
  • When the agent restarts, if the state is ErrorReset and spec is Idle, the agent shall uncordon the node based on the recorded previous state, and set the current status to Idle, else the agent shall try to process as usual.
  • Upon a successful Bottlerocket update circle, the crash_count variable shall be reset to 0 and crash time stamp shall be set to None.

Testing done:

Without crash

Provisioned 3 old hosts with Brupop installed, monitored cluster's update.

With crash

Intentionally make update() error to make sure StagedUpdate->PerformedUpdate always fails. Montoring the cluster to confirm the following feature:

  • Agent set the status to ErrorReset when something failed
  • Controller guide the agent to recover
  • Controller tend to pick the node with lower crash time
  • Controller retry using exponential backoff

Not covered in test

  • Controller reset the crash ~24 hours since the first crash.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Comment on lines 396 to 414
BottlerocketShadowState::ErrorReset => {
// Spec should never be ErrorReset
panic!("Spec state should never be ErrorReset");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that panic! makes sense since this state indicates that the software is broken. We'd been handling cases like this elsewhere by making an Assertion error variant that we return, which makes sure that the software bubbles up through usual error handling code (like writing the termination log).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not understand what you meant. Do you mean something similar to

    let token_path = token_path.to_str().context(error::Assertion {
        message: "Token path (defined in models/agent.rs) is not valid unicode.",
    })?;

But this just gives out an error to let the caller handle. In this case, we want the program to panic since this should not happen. The difference would just be panic here or the caller panic later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think the idea is to have the caller handle gracefully terminating the program, since our outer error handler is writing to the termination log it will be properly logged. If we let the program panic here, it will be really difficult to determine why it failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there is better way to write this?

agentclient_error::Error::Assertion { message: msg } => {
                            return agentclient_error::Assertion { message: msg }.fail();
                        }

@somnusfish somnusfish force-pushed the crash_loop branch 3 times, most recently from abe9b80 to 0220758 Compare March 8, 2022 07:40
@somnusfish somnusfish requested a review from cbgbt March 8, 2022 08:01
@@ -165,6 +165,8 @@ impl<T: APIServerClient> BrupopAgent<T> {
async fn gather_system_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this function name may be showing that it was a strange choice. Maybe something like current_shadow_status would be clearer? I don't exactly love that name either. There may be something better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make it clear that this method refetch the system metadata. Maybe shadow_status_with_refetch_system_matadata ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shadow_status_with_refreshed_system_matadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that suggestion!

Comment on lines 288 to 289
crash_count: u32,
state_transition_failure_timestamp: Option<DateTime<Utc>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that might help with the proliferation of these arguments would be to encapsulate the error handling state in a struct, like ShadowErrorInfo or something like that. This could be something we implement the Default trait on so that we don't need to use the constants 0 and None when constructing the Status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that.

Comment on lines 449 to 450
crash_count = 0;
state_transition_failure_timestamp = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to wait to apply these until the time that we are transitioning from MonitoringForUpdate into Idle.

Reason being is that we want to be able to add support for customers to add custom checks that the node is healthy (#12, which references this similar features in CLUO). Basically, we want any check that fails here to result the node being reverted to the previous version, and maintaining its crash loop state.

We can implement that functionality in a future release, but we should still make sure to only assign the fresh state after we've reached Idle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Till here agent already handled the state transitioning from MonitoringForUpdate into Idle.

The logic for state transition is in self.handle_state_transition, this part is just the part to set the correct status to update. What you mentioned can be self.handle_state_transition and causing Err() with the return value which goes into the error handling part.

/// Order BottleRocketShadow based on crash_count in status
/// to determine the priority to be handled by the controller.
/// Uninitialized status should be considered as lowest priority.
pub fn compare_status(&self, other: &Self) -> Ordering {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion:

Suggested change
pub fn compare_status(&self, other: &Self) -> Ordering {
pub fn compare_crash_count(&self, other: &Self) -> Ordering {

I really like the use of cmp here, by the way :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky to implement cmp for BottlerocketShadow though, that's why I did this method.

use tracing::instrument;
use tracing::{event, Level};

const RETRY_MAX_DELAY_IN_MINUTES: i64 = 24 * 60;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likewise be stored as a Duration to avoid having to encode the type in the name (MINUTES)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Duration in chrono is not constants function, I have the following error:

calls in constants are limited to constant functions, tuple structs and tuple variants

The Duration in tokio and std::time seems only operate on seconds and below, I am kind of feel that could cause more confusion. So I will keep this here for the next version. We can discuss more here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to use std::time::Duration. It does seem to mostly rely on second granularity for its interface, but I think it's usually a good idea to encode a measured unit in your type, rather than in your variable name.

I think with the lack of minute interface, I'm willing to disagree but ship it in spite of this.

@cbgbt
Copy link
Contributor

cbgbt commented Mar 30, 2022

I think this is ready. There are a few nits, but they can be iteratively cleaned up or added here, but I'm overall happy with the behavior.

Exponential backoff in retry with max 24 hours waiting time to prevent
the agent crashes infinitely. With each retry starting in Idle state,
this algorithm allows the agent to fetch the latest Bottlerocket for
each retry. By waiting the agent going back to Idle for recovery, this
algorithm prevents the controller brown out the cluster.
@somnusfish somnusfish merged commit 8e74b1f into bottlerocket-os:develop Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants