Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler vs aws: do not backoff scaling when aws api is thr… #5271

Closed
wants to merge 1 commit into from

Conversation

grosser
Copy link
Contributor

@grosser grosser commented Oct 25, 2022

…ottled

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

when aws api is throttled scaling requests can fail and then mark the nodegroup as backed off
this results in asgs scaling inbalanced or even refusing to scale at all if it hits all asgs

Special notes for your reviewer:

Does this PR introduce a user-facing change?

-->

- cluser-autoscaler vs aws: do not stop scaling\ when aws is getting rate limited

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 25, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: grosser
Once this PR has been reviewed and has the lgtm label, please assign towca for approval by writing /assign @towca in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from feiskyer and x13n October 25, 2022 21:31
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 25, 2022
@@ -704,7 +704,8 @@ func executeScaleUp(context *context.AutoscalingContext, clusterStateRegistry *c
if err := info.Group.IncreaseSize(increase); err != nil {
context.LogRecorder.Eventf(apiv1.EventTypeWarning, "FailedToScaleUpGroup", "Scale-up failed for group %s: %v", info.Group.Id(), err)
aerr := errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to increase node group size: ")
clusterStateRegistry.RegisterFailedScaleUp(info.Group, metrics.FailedScaleUpReason(string(aerr.Type())), now)
backoff := (strings.Contains(aerr.Error(), "Throttling: Rate exceeded"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on specific wording of an error is both brittle and cloud-provider specific. I think this should either:

  • Extend IncreaseSize return value to include a bool indicating whether the node group should go into backoff OR
  • Embed information about backoff into error returned from IncreaseSize. E.g.: introduce some new CustomBackoffError interface with ShouldBackoff() bool func and return an error type which implements that function. Here go into backoff for regular errors and for errors that implement the new interface AND return false from that function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good solution for cloud provider because it leaves the backoff decision to cloud provider. Maybe we should consider this and have some plan? @x13n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether you're referring to the PR as a whole or to my comment here. I think the idea represented by the PR is right (some errors should not result in backoff), but the distinction between backoff and non-backoff cases belongs in each cloud provider separately.

@gjtempleton gjtempleton added the area/provider/aws Issues or PRs related to aws provider label Dec 11, 2022
@x13n
Copy link
Member

x13n commented Dec 19, 2022

/close

There's no response for almost 2 months, so closing. Please reopen if still needed.

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

/close

There's no response for almost 2 months, so closing. Please reopen if still needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@grosser
Copy link
Contributor Author

grosser commented Dec 26, 2022

This is still needed, but I don't see an easy way of building this in cloud-provider-agnostic way.
It should definitely be addressed since it's a annoying bug when you have an account that is rate-limited regularly (not super common I assume).

@x13n
Copy link
Member

x13n commented Dec 27, 2022

Yeah, I agree the cloud-provider-agnostic approach will be a bit harder to implement. IncreaseSize function has to be able to return that extra bit of information. Returning something that implements error interface and additionally allows to do a bool check for whether the error is permanent or not should do the trick though.

@grosser
Copy link
Contributor Author

grosser commented Sep 7, 2023

FYI rebased commit in case anyone needs this grosser@a836c6f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants