(Work in progess) MCO-1518: Promote MCN to GA #2171

isabella-janssen · 2025-01-27T01:47:10Z

No description provided.

openshift-ci · 2025-01-27T01:47:14Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-01-27T01:47:15Z

Hello @isabella-janssen! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci · 2025-01-27T01:47:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: isabella-janssen
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-01-27T13:36:01Z

@isabella-janssen: This pull request references MCO-1518 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

isabella-janssen · 2025-01-28T15:30:51Z

/test lint

yuqi-zhang

Did a first pass, some comments (mostly historical questions from alpha1 days)

machineconfiguration/v1/types_machineconfignode.go

yuqi-zhang · 2025-01-30T01:12:21Z

machineconfiguration/v1/types_machineconfignode.go

+	// The desired version represents the machine config the node will attempt to update to. This gets set before the machine config operator validates
+	// the new machine config against the current machine config.
+	// +required
+	ConfigVersion MachineConfigNodeSpecMachineConfigVersion `json:"configVersion"`


Hmm, I'm not sure I really understand why we had this originally. It looks like the status field has current and desired (rendered machineconfigs), and then this spec also has a desired field, and the difference is... the desired here is set before the node validates, and the desired in status is post-validation? Then what's the intent of the "current" in status?

Maybe we should consider consolidating these.

More information about the intended purpose of the Spec.ConfigVersion.Desired field is described in the original enhancement. I'd like to leave this field as is for now and get some feedback from more people throughout the review process.

@djoshy and I discussed this briefly as well. My instinct is that it would make more sense to have the spec reflect the desired config, and the status reflect the current config, to match the MCP/node annotations. (More concretely, drop the desired from MachineConfigNodeStatusMachineConfigVersion)

machineconfiguration/v1/types_machineconfignode.go

isabella-janssen · 2025-01-30T21:27:58Z

/test lint

isabella-janssen · 2025-02-04T19:37:46Z

/test lint

openshift-ci · 2025-02-04T19:44:46Z

@isabella-janssen: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

yuqi-zhang · 2025-02-27T22:13:02Z

machineconfiguration/v1/types_machineconfignode.go

+	// The error is an empty string if the image pull and pin is successful.
+	// +kubebuilder:validation:MaxLength=3276
+	// +optional
+	LastSeenError string `json:"lastSeenError,omitempty"`


We discussed this via slack/call, but I think what we're settling towards is having the error and the error'ed generation here.

Alternatively, we've noted that the PinnedImageSet status can be beefed up a bit more: https://github.com/openshift/api/pull/2198/files#r1972118770

Although thinking about it more now, one pinnedimageset can be used for multiple pools (since it's designed to be a list of images) so maybe we still want the granular status to be here instead cc @djoshy

yuqi-zhang · 2025-02-27T22:18:26Z

machineconfiguration/v1/types_machineconfignode.go

+	// MachineConfigNodeUpdateResumed describes a machine that has resumed normal processes
+	MachineConfigNodeUpdateResumed StateProgress = "Resumed"
+	// MachineConfigNodeUpdateCompatible the part of the preparing phase where the mco decides whether it can update
+	MachineConfigNodeUpdateCompatible StateProgress = "UpdateCompatible"


Thinking through the statuses a bit more, the compatibility check now mostly lies within the controller, so maybe this is no longer necessary

The MachineConfigNodeUpdateCompatible phase is the only child phase of the MachineConfigNodeUpdatePrepared phase. Therefore, the reported status of UpdatePrepared is always a mirror of UpdateCompatible. These phases report Unknown when the updates are not compatible and True when the updates are fine to proceed.

Compatibility check failed due to reconcilable error

Compatibility check failed due to error calculating post config change actions

Compatibility check succeeds

Since the prepared phase only encapsulates the compatibility check, there would be no loss in status update granularity, from my perspective, if the MachineConfigNodeUpdateCompatible phase is removed.

yuqi-zhang · 2025-02-27T22:19:23Z

machineconfiguration/v1/types_machineconfignode.go

+	// MachineConfigNodeUpdateRebooted describes the part of the post action phase where the node reboots itself
+	MachineConfigNodeUpdateRebooted StateProgress = "RebootedNode"
+	// MachineConfigNodeUpdateReloaded describes the part of the post action phase where the node reloads its CRIO service
+	MachineConfigNodeUpdateReloaded StateProgress = "ReloadedCRIO"


Since we allow admin defined actions, maybe we shouldn't track specifics anymore

yuqi-zhang · 2025-02-27T22:20:04Z

machineconfiguration/v1/types_machineconfignode.go

+	// MachineConfigNodeUpdateFilesAndOS describes the part of the inprogress phase where the nodes file and OS config change
+	MachineConfigNodeUpdateFilesAndOS StateProgress = "AppliedFilesAndOS"
+	// MachineConfigNodeUpdateCordoned describes the part of the completing phase where the node cordons
+	MachineConfigNodeUpdateCordoned StateProgress = "Cordoned"


The description seems to imply that these would be only flipped to true while cordon/uncordoning, which in practice should be almost instant. Maybe not worth making them individual fields?

These fields are flipped to Unknown if the node fails to cordon or uncordon. Alternatively, they flip to True once the node cordon or uncordon is complete. As with all MCN statuses, when a node is fully updated (desired config = current config), these statuses are listed as False.

Cordon failed

Uncordon failed

Cordon successfully completed

Uncordon successfully completed

How often does a node fail to cordon/uncordon? Is it helpful to have this level of granularity for understanding why a node fails to update?

Similar to the sentiment in the update compatibility comment, MachineConfigNodeUpdateUncordoned is the only child phase of MachineConfigNodeUpdateComplete, so removing the MachineConfigNodeUpdateUncordoned phase would not result in a loss in status update granularity. However, if we decide to keep the cordoned reporting it might be good to be consistent and keep uncordoned.

I'll try to continue any discussion via #2201 instead, apologies for splitting the conversation. Just to cap this point out,

How often does a node fail to cordon/uncordon? Is it helpful to have this level of granularity for understanding why a node fails to update?

This almost never happens (only if the API itself is inaccessible), and the MCO generally retries and drowns any temporary error out almost immediately. The failures we see here is (very rarely) on the opposite side, where the MCD thinks its uncordoned the node, but was not able to successfully do so, leaving the state of the node as cordoned until manual intervention. For these cases, the best we could determine was that there was some weird API behaviour.

Alternatively, if the intent of these states is to bridge the existing gap where a cordon on the node came from not the MCO, they may be worth it to keep. The idea being that if a node was manually cordoned or cordoned by another component, but the MCN's MachineConfigNodeUpdateUncordoned is true, then we know the MCC had successfully done so and the current cordon's ownership is us.

isabella-janssen added 2 commits January 23, 2025 21:26

addition of v1 mcn type & test

c80b09e

addition of generated files

c32f007

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025

openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 27, 2025

cleanup

5a27930

isabella-janssen changed the title ~~MCO-1518~~ MCO-1518: Promote MCN to GA Jan 27, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2025

update rfc1123 validation

369015f

yuqi-zhang reviewed Jan 30, 2025

View reviewed changes

update descriptions, types, and validation

5977184

isabella-janssen added 2 commits February 3, 2025 12:22

update doc comments

8bb354e

update maxitems for lists

a838189

isabella-janssen added 2 commits February 6, 2025 12:47

update descriptions

9d9b456

update PIS section of MCN object

8f08686

isabella-janssen changed the title ~~MCO-1518: Promote MCN to GA~~ (Work in progess) MCO-1518: Promote MCN to GA Feb 13, 2025

isabella-janssen mentioned this pull request Feb 26, 2025

MCO-1543: Update v1alpha1 MCN API #2201

Open

yuqi-zhang reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Work in progess) MCO-1518: Promote MCN to GA #2171

(Work in progess) MCO-1518: Promote MCN to GA #2171

isabella-janssen commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025 •

edited by openshift-ci bot

Loading

isabella-janssen commented Jan 28, 2025

yuqi-zhang left a comment

yuqi-zhang Jan 30, 2025

isabella-janssen Feb 6, 2025

yuqi-zhang Feb 27, 2025

isabella-janssen commented Jan 30, 2025

isabella-janssen commented Feb 4, 2025

openshift-ci bot commented Feb 4, 2025

yuqi-zhang Feb 27, 2025

yuqi-zhang Feb 27, 2025

isabella-janssen Feb 28, 2025

yuqi-zhang Feb 27, 2025

yuqi-zhang Feb 27, 2025

isabella-janssen Feb 28, 2025 •

edited

Loading

yuqi-zhang Feb 28, 2025

(Work in progess) MCO-1518: Promote MCN to GA #2171

Are you sure you want to change the base?

(Work in progess) MCO-1518: Promote MCN to GA #2171

Conversation

isabella-janssen commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025 • edited by openshift-ci bot Loading

isabella-janssen commented Jan 28, 2025

yuqi-zhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isabella-janssen commented Jan 30, 2025

isabella-janssen commented Feb 4, 2025

openshift-ci bot commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isabella-janssen Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 27, 2025 •

edited by openshift-ci bot

Loading

isabella-janssen Feb 28, 2025 •

edited

Loading