Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA DRA: integrate template NodeInfos with K8s API #7799

Open
towca opened this issue Feb 3, 2025 · 0 comments
Open

CA DRA: integrate template NodeInfos with K8s API #7799

towca opened this issue Feb 3, 2025 · 0 comments
Assignees
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@towca
Copy link
Collaborator

towca commented Feb 3, 2025

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

During autoscaling simulations Cluster Autoscaler has to predict how a new, empty Node from a given NodeGroup would look like if CA were to scale the NodeGroup up. This is called a template NodeInfo, and the logic for computing it is roughly:

  • If the NodeGroup has at least 1 healthy Node, CA takes that Node as a base for the template and sanitizes it - changes the parts that are Node-specific (like Name or UID), and removes Pods that are not DaemonSet/static (because they won't be present on a new Node).
  • If the NodeGroup doesn't have any healthy Nodes, CA delegates computing the template to CloudProvider.TemplateNodeInfo(). Most CloudProvider.TemplateNodeInfo() implementations create the template in-memory from some information tracked on the CloudProvider side for the NodeGroup (e.g. a VM instance template).

The first method is pretty reliable, but it requires having at least 1 Node kept in the NodeGroup at all times, which can be cost-prohibitive for expensive hardware. The reliability of the second method varies between CloudProvider implementations.

To support DRA, CloudProvider.TemplateNodeInfo() has to predict ResourceSlices and potentially ResourceClaims in addition to the Node and its Pods.

We have the following problems with the current setup:

  • Template NodeInfos have poor visibility and debuggability. CA doesn't log much about them (because of the volume), in case of tricky issues the debugging snapshot has to be taken on-demand and analyzed. This can only be done by the cluster admin, a regular cluster user doesn't really have any visibility.
  • There is no standard way of influencing the CloudProvider.TemplateNodeInfo() templates by a regular cluster user (e.g. if the user has a DS pod that exposes an extended resource). Some CloudProvider implementations give the cluster user some control (e.g. AWS, via ASG tags), but even though they allow configuring the same things (e.g. extended resources), they do so in provider-specific ways (e.g. ASG tags on AWS vs KUBE_ENV variable in MIG instance templates on GCE).
  • There are more template objects to track with DRA, and the new objects can be quite complex. Creating them in-memory from scratch might become non-trivial, and could in some cases be better delegated to some other component where the logic fits better (e.g. a cloud provider control plane creating the NodeGroup).

Describe the solution you'd like.:

IMO we should integrate the template NodeInfo concept with the K8s API.

We could introduce a NodeTemplate/NodeGroupTemplate CRD:

  • The Spec would contain the scale-from-0 template NodeInfo (i.e. today's CloudProvider.TemplateNodeInfo()) set by the cluster admin.
  • The Spec would allow to modify/override the scale-from-0 template by the cluster user.
  • The Status would contain the actual template used by Cluster Autoscaler. This could be the scale-from-0 template from the Spec, but it could be obtained differently (e.g. by sanitizing a real Node).

Which would help us with the problems:

  • We'd have visibility into template NodeInfos used by CA through the Status.
  • We'd have a standard, CloudProvider-agnostic way of modifying the templates by a regular cluster user - by changing the Spec.
  • Computing the scale-from-0 template could be delegated to a component other than CA - the component would just change the Spec.

There are a lot of details to be figured out, in particular how this relates to the Karpenter NodePool model. If it makes sense, we should generalize the concept to be useful for both Node Autoscalers. In any case, this seems like it would require writing a KEP.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

@towca towca self-assigned this Feb 3, 2025
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Feb 3, 2025
@mbrow137 mbrow137 moved this from 🆕 New to 🏗 In Progress in Dynamic Resource Allocation Feb 4, 2025
@mbrow137 mbrow137 moved this from 🏗 In Progress to 🆕 New in Dynamic Resource Allocation Feb 4, 2025
@mbrow137 mbrow137 moved this from 🆕 New to 🏗 In Progress in Dynamic Resource Allocation Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

2 participants