-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reset algorithm to better handle crash loop. #160
Conversation
agent/src/agentclient.rs
Outdated
BottlerocketShadowState::ErrorReset => { | ||
// Spec should never be ErrorReset | ||
panic!("Spec state should never be ErrorReset"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that panic!
makes sense since this state indicates that the software is broken. We'd been handling cases like this elsewhere by making an Assertion
error variant that we return, which makes sure that the software bubbles up through usual error handling code (like writing the termination log).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not understand what you meant. Do you mean something similar to
let token_path = token_path.to_str().context(error::Assertion {
message: "Token path (defined in models/agent.rs) is not valid unicode.",
})?;
But this just gives out an error to let the caller handle. In this case, we want the program to panic since this should not happen. The difference would just be panic here or the caller panic later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think the idea is to have the caller handle gracefully terminating the program, since our outer error handler is writing to the termination log it will be properly logged. If we let the program panic here, it will be really difficult to determine why it failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there is better way to write this?
agentclient_error::Error::Assertion { message: msg } => {
return agentclient_error::Assertion { message: msg }.fail();
}
abe9b80
to
0220758
Compare
agent/src/agentclient.rs
Outdated
@@ -165,6 +165,8 @@ impl<T: APIServerClient> BrupopAgent<T> { | |||
async fn gather_system_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this function name may be showing that it was a strange choice. Maybe something like current_shadow_status
would be clearer? I don't exactly love that name either. There may be something better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make it clear that this method refetch the system metadata. Maybe shadow_status_with_refetch_system_matadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shadow_status_with_refreshed_system_matadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that suggestion!
agent/src/agentclient.rs
Outdated
crash_count: u32, | ||
state_transition_failure_timestamp: Option<DateTime<Utc>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that might help with the proliferation of these arguments would be to encapsulate the error handling state in a struct, like ShadowErrorInfo
or something like that. This could be something we implement the Default
trait on so that we don't need to use the constants 0
and None
when constructing the Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that.
agent/src/agentclient.rs
Outdated
crash_count = 0; | ||
state_transition_failure_timestamp = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to wait to apply these until the time that we are transitioning from MonitoringForUpdate
into Idle
.
Reason being is that we want to be able to add support for customers to add custom checks that the node is healthy (#12, which references this similar features in CLUO). Basically, we want any check that fails here to result the node being reverted to the previous version, and maintaining its crash loop state.
We can implement that functionality in a future release, but we should still make sure to only assign the fresh state after we've reached Idle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Till here agent already handled the state transitioning from MonitoringForUpdate
into Idle
.
The logic for state transition is in self.handle_state_transition
, this part is just the part to set the correct status to update. What you mentioned can be self.handle_state_transition
and causing Err() with the return value which goes into the error handling part.
models/src/node/mod.rs
Outdated
/// Order BottleRocketShadow based on crash_count in status | ||
/// to determine the priority to be handled by the controller. | ||
/// Uninitialized status should be considered as lowest priority. | ||
pub fn compare_status(&self, other: &Self) -> Ordering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming suggestion:
pub fn compare_status(&self, other: &Self) -> Ordering { | |
pub fn compare_crash_count(&self, other: &Self) -> Ordering { |
I really like the use of cmp
here, by the way :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky to implement cmp
for BottlerocketShadow
though, that's why I did this method.
use tracing::instrument; | ||
use tracing::{event, Level}; | ||
|
||
const RETRY_MAX_DELAY_IN_MINUTES: i64 = 24 * 60; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should likewise be stored as a Duration
to avoid having to encode the type in the name (MINUTES
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Duration in chrono is not constants function, I have the following error:
calls in constants are limited to constant functions, tuple structs and tuple variants
The Duration in tokio and std::time seems only operate on seconds and below, I am kind of feel that could cause more confusion. So I will keep this here for the next version. We can discuss more here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd want to use std::time::Duration
. It does seem to mostly rely on second granularity for its interface, but I think it's usually a good idea to encode a measured unit in your type, rather than in your variable name.
I think with the lack of minute interface, I'm willing to disagree but ship it in spite of this.
I think this is ready. There are a few nits, but they can be iteratively cleaned up or added here, but I'm overall happy with the behavior. |
Exponential backoff in retry with max 24 hours waiting time to prevent the agent crashes infinitely. With each retry starting in Idle state, this algorithm allows the agent to fetch the latest Bottlerocket for each retry. By waiting the agent going back to Idle for recovery, this algorithm prevents the controller brown out the cluster.
Exponential backoff in retry with ~24 hours counter reset to prevent
the agent crashes infinitely. With each retry starting in Idle state,
this algorithm allows the agent to fetch the latest Bottlerocket for
each retry. By waiting the agent going back to Idle for recovery, this
algorithm prevents the controller brown out the cluster.
Issue number:
#123
Description of changes:
Testing done:
Without crash
Provisioned 3 old hosts with Brupop installed, monitored cluster's update.
With crash
Intentionally make
update()
error to make sure StagedUpdate->PerformedUpdate always fails. Montoring the cluster to confirm the following feature:Not covered in test
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.