CA DRA: integrate template NodeInfos with K8s API #7799
Labels
area/cluster-autoscaler
area/core-autoscaler
Denotes an issue that is related to the core autoscaler and is not specific to any provider.
wg/device-management
Categorizes an issue or PR as relevant to WG Device Management.
Which component are you using?:
/area cluster-autoscaler
/area core-autoscaler
/wg device-management
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
During autoscaling simulations Cluster Autoscaler has to predict how a new, empty Node from a given NodeGroup would look like if CA were to scale the NodeGroup up. This is called a template NodeInfo, and the logic for computing it is roughly:
CloudProvider.TemplateNodeInfo()
. MostCloudProvider.TemplateNodeInfo()
implementations create the template in-memory from some information tracked on the CloudProvider side for the NodeGroup (e.g. a VM instance template).The first method is pretty reliable, but it requires having at least 1 Node kept in the NodeGroup at all times, which can be cost-prohibitive for expensive hardware. The reliability of the second method varies between CloudProvider implementations.
To support DRA,
CloudProvider.TemplateNodeInfo()
has to predict ResourceSlices and potentially ResourceClaims in addition to the Node and its Pods.We have the following problems with the current setup:
CloudProvider.TemplateNodeInfo()
templates by a regular cluster user (e.g. if the user has a DS pod that exposes an extended resource). Some CloudProvider implementations give the cluster user some control (e.g. AWS, via ASG tags), but even though they allow configuring the same things (e.g. extended resources), they do so in provider-specific ways (e.g. ASG tags on AWS vs KUBE_ENV variable in MIG instance templates on GCE).Describe the solution you'd like.:
IMO we should integrate the template NodeInfo concept with the K8s API.
We could introduce a
NodeTemplate
/NodeGroupTemplate
CRD:CloudProvider.TemplateNodeInfo()
) set by the cluster admin.Which would help us with the problems:
There are a lot of details to be figured out, in particular how this relates to the Karpenter NodePool model. If it makes sense, we should generalize the concept to be useful for both Node Autoscalers. In any case, this seems like it would require writing a KEP.
Additional context.:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.
The text was updated successfully, but these errors were encountered: