-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: grosser The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -704,7 +704,8 @@ func executeScaleUp(context *context.AutoscalingContext, clusterStateRegistry *c | |||
if err := info.Group.IncreaseSize(increase); err != nil { | |||
context.LogRecorder.Eventf(apiv1.EventTypeWarning, "FailedToScaleUpGroup", "Scale-up failed for group %s: %v", info.Group.Id(), err) | |||
aerr := errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to increase node group size: ") | |||
clusterStateRegistry.RegisterFailedScaleUp(info.Group, metrics.FailedScaleUpReason(string(aerr.Type())), now) | |||
backoff := (strings.Contains(aerr.Error(), "Throttling: Rate exceeded")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on specific wording of an error is both brittle and cloud-provider specific. I think this should either:
- Extend
IncreaseSize
return value to include a bool indicating whether the node group should go into backoff OR - Embed information about backoff into error returned from
IncreaseSize
. E.g.: introduce some newCustomBackoffError
interface withShouldBackoff() bool
func and return an error type which implements that function. Here go into backoff for regular errors and for errors that implement the new interface AND return false from that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good solution for cloud provider because it leaves the backoff decision to cloud provider. Maybe we should consider this and have some plan? @x13n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether you're referring to the PR as a whole or to my comment here. I think the idea represented by the PR is right (some errors should not result in backoff), but the distinction between backoff and non-backoff cases belongs in each cloud provider separately.
/close There's no response for almost 2 months, so closing. Please reopen if still needed. |
@x13n: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is still needed, but I don't see an easy way of building this in cloud-provider-agnostic way. |
Yeah, I agree the cloud-provider-agnostic approach will be a bit harder to implement. |
FYI rebased commit in case anyone needs this grosser@a836c6f |
…ottled
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind bug
What this PR does / why we need it:
when aws api is throttled scaling requests can fail and then mark the nodegroup as backed off
this results in asgs scaling inbalanced or even refusing to scale at all if it hits all asgs
Special notes for your reviewer:
Does this PR introduce a user-facing change?
-->