cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271

grosser · 2022-10-25T21:31:08Z

…ottled

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

when aws api is throttled scaling requests can fail and then mark the nodegroup as backed off
this results in asgs scaling inbalanced or even refusing to scale at all if it hits all asgs

Special notes for your reviewer:

Does this PR introduce a user-facing change?

-->

- cluser-autoscaler vs aws: do not stop scaling\ when aws is getting rate limited

…ottled

k8s-ci-robot · 2022-10-25T21:31:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: grosser
Once this PR has been reviewed and has the lgtm label, please assign towca for approval by writing /assign @towca in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

x13n · 2022-10-27T14:03:28Z

cluster-autoscaler/core/scale_up.go

@@ -704,7 +704,8 @@ func executeScaleUp(context *context.AutoscalingContext, clusterStateRegistry *c
 	if err := info.Group.IncreaseSize(increase); err != nil {
 		context.LogRecorder.Eventf(apiv1.EventTypeWarning, "FailedToScaleUpGroup", "Scale-up failed for group %s: %v", info.Group.Id(), err)
 		aerr := errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to increase node group size: ")
-		clusterStateRegistry.RegisterFailedScaleUp(info.Group, metrics.FailedScaleUpReason(string(aerr.Type())), now)
+		backoff := (strings.Contains(aerr.Error(), "Throttling: Rate exceeded"))


Relying on specific wording of an error is both brittle and cloud-provider specific. I think this should either:

Extend IncreaseSize return value to include a bool indicating whether the node group should go into backoff OR

Embed information about backoff into error returned from IncreaseSize. E.g.: introduce some new CustomBackoffError interface with ShouldBackoff() bool func and return an error type which implements that function. Here go into backoff for regular errors and for errors that implement the new interface AND return false from that function.

I think this is a good solution for cloud provider because it leaves the backoff decision to cloud provider. Maybe we should consider this and have some plan? @x13n

Not sure whether you're referring to the PR as a whole or to my comment here. I think the idea represented by the PR is right (some errors should not result in backoff), but the distinction between backoff and non-backoff cases belongs in each cloud provider separately.

x13n · 2022-12-19T09:15:02Z

/close

There's no response for almost 2 months, so closing. Please reopen if still needed.

k8s-ci-robot · 2022-12-19T09:15:08Z

@x13n: Closed this PR.

In response to this:

/close

There's no response for almost 2 months, so closing. Please reopen if still needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

grosser · 2022-12-26T20:45:26Z

This is still needed, but I don't see an easy way of building this in cloud-provider-agnostic way.
It should definitely be addressed since it's a annoying bug when you have an account that is rate-limited regularly (not super common I assume).

x13n · 2022-12-27T14:45:16Z

Yeah, I agree the cloud-provider-agnostic approach will be a bit harder to implement. IncreaseSize function has to be able to return that extra bit of information. Returning something that implements error interface and additionally allows to do a bool check for whether the error is permanent or not should do the trick though.

grosser · 2023-09-07T00:30:22Z

FYI rebased commit in case anyone needs this grosser@a836c6f

cluster-autoscaler vs aws: do not backoff scaling when aws api is thr…

eb273c5

…ottled

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 25, 2022

k8s-ci-robot requested review from feiskyer and x13n October 25, 2022 21:31

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 25, 2022

jbartosik added the area/cluster-autoscaler label Oct 27, 2022

x13n reviewed Oct 27, 2022

View reviewed changes

gjtempleton added the area/provider/aws Issues or PRs related to aws provider label Dec 11, 2022

k8s-ci-robot closed this Dec 19, 2022

Chase-Marino mentioned this pull request Jan 28, 2025

feat: ca: do not backoff scale up on specified errors #7777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271

cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271

grosser commented Oct 25, 2022

k8s-ci-robot commented Oct 25, 2022

x13n Oct 27, 2022

qianlei90 Dec 19, 2022

x13n Dec 19, 2022

x13n commented Dec 19, 2022

k8s-ci-robot commented Dec 19, 2022

grosser commented Dec 26, 2022

x13n commented Dec 27, 2022

grosser commented Sep 7, 2023 •

edited

Loading

cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271

cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271

Conversation

grosser commented Oct 25, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Oct 25, 2022

x13n Oct 27, 2022

Choose a reason for hiding this comment

qianlei90 Dec 19, 2022

Choose a reason for hiding this comment

x13n Dec 19, 2022

Choose a reason for hiding this comment

x13n commented Dec 19, 2022

k8s-ci-robot commented Dec 19, 2022

grosser commented Dec 26, 2022

x13n commented Dec 27, 2022

grosser commented Sep 7, 2023 • edited Loading

grosser commented Sep 7, 2023 •

edited

Loading