Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve events when max total nodes of the cluster is reached. #7667

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions cluster-autoscaler/core/static_autoscaler.go
Original file line number Diff line number Diff line change
Expand Up @@ -528,7 +528,18 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr
} else if a.MaxNodesTotal > 0 && len(readyNodes) >= a.MaxNodesTotal {
scaleUpStatus.Result = status.ScaleUpLimitedByMaxNodesTotal
klog.Warningf("Max total nodes in cluster reached: %v. Current number of ready nodes: %v", a.MaxNodesTotal, len(readyNodes))
autoscalingContext.LogRecorder.Eventf(apiv1.EventTypeWarning, "MaxNodesTotalReached",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of directly generating the event here, we should make this information a part of the scale up status, so that EventingScaleUpProcessor (and possibly other status processors) can be aware of this condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that the cluster wide event would make sense in the EventingScaleUpProcessor as well or only the per pod events? Please let me know what you think about the cluster wide event - I would personally think that keeping it together with log statement would make more sense, but happy to move if you have other opinion here.

It is also a bit weird that we are not running scale up at all and still run EventingScaleUpProcessor and I am afraid that someone at some point will change that and the events won't be generated any more. But for now it indeed seems that this processor would be a better place at least for the per pod events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not triggering any scale up for a pod because CA couldn't find a way to help is something we do want communicated externally, so I wouldn't worry about events for lack of scale up not getting generated - that'd just be a bug.

I'm ok with cluster wide event though I expect users to be more concerned about the pods they created and hence observing the events generated for these pods.

"Max total nodes in cluster reached: %v", autoscalingContext.MaxNodesTotal)
shouldScaleUp = false

noScaleUpInfoForPods := []status.NoScaleUpInfo{}
for _, pod := range unschedulablePodsToHelp {
noScaleUpInfo := status.NoScaleUpInfo{
Pod: pod,
}
noScaleUpInfoForPods = append(noScaleUpInfoForPods, noScaleUpInfo)
}
scaleUpStatus.PodsRemainUnschedulable = noScaleUpInfoForPods
} else if len(a.BypassedSchedulers) == 0 && allPodsAreNew(unschedulablePodsToHelp, currentTime) {
// The assumption here is that these pods have been created very recently and probably there
// is more pods to come. In theory we could check the newest pod time but then if pod were created
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ func (p *EventingScaleUpStatusProcessor) Process(context *context.AutoscalingCon
for _, noScaleUpInfo := range status.PodsRemainUnschedulable {
context.Recorder.Event(noScaleUpInfo.Pod, apiv1.EventTypeNormal, "NotTriggerScaleUp",
fmt.Sprintf("pod didn't trigger scale-up: %s",
ReasonsMessage(noScaleUpInfo, consideredNodeGroupsMap)))
ReasonsMessage(status.Result, noScaleUpInfo, consideredNodeGroupsMap)))
}
} else {
klog.V(4).Infof("Skipping event processing for unschedulable pods since there is a" +
Expand All @@ -60,7 +60,11 @@ func (p *EventingScaleUpStatusProcessor) CleanUp() {
}

// ReasonsMessage aggregates reasons from NoScaleUpInfos.
func ReasonsMessage(noScaleUpInfo NoScaleUpInfo, consideredNodeGroups map[string]cloudprovider.NodeGroup) string {
func ReasonsMessage(scaleUpStatus ScaleUpResult, noScaleUpInfo NoScaleUpInfo, consideredNodeGroups map[string]cloudprovider.NodeGroup) string {
if scaleUpStatus == ScaleUpLimitedByMaxNodesTotal {
return "max total nodes in cluster reached"
}

messages := []string{}
aggregated := map[string]int{}
for nodeGroupId, reasons := range noScaleUpInfo.RejectedNodeGroups {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,21 @@ func TestEventingScaleUpStatusProcessor(t *testing.T) {
expectedTriggered: 0,
expectedNoTriggered: 0,
},
{
caseName: "No scale up; max total nodes in cluster reached",
state: &ScaleUpStatus{
Result: ScaleUpLimitedByMaxNodesTotal,
ScaleUpInfos: []nodegroupset.ScaleUpInfo{{}},
PodsTriggeredScaleUp: []*apiv1.Pod{},
PodsRemainUnschedulable: []NoScaleUpInfo{
{Pod: p1},
{Pod: p2},
{Pod: p3},
},
},
expectedTriggered: 0,
expectedNoTriggered: 3,
},
}

for _, tc := range testCases {
Expand Down Expand Up @@ -166,9 +181,18 @@ func TestReasonsMessage(t *testing.T) {
"2 max limit reached",
"1 not ready",
}
result := ReasonsMessage(NoScaleUpInfo{nil, rejected, skipped}, considered)
result := ReasonsMessage(ScaleUpNoOptionsAvailable, NoScaleUpInfo{nil, rejected, skipped}, considered)

for _, part := range expected {
assert.Contains(t, result, part)
}
}

func TestReasonsMessageWhenScaleUpLimitedByMaxNodesTotal(t *testing.T) {
considered := map[string]cloudprovider.NodeGroup{}
noScaleUpInfo := NoScaleUpInfo{
Pod: nil,
}
result := ReasonsMessage(ScaleUpLimitedByMaxNodesTotal, noScaleUpInfo, considered)
assert.Contains(t, result, "max total nodes in cluster reached")
}
Loading