Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Work in progess) MCO-1518: Promote MCN to GA #2171

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

isabella-janssen
Copy link
Member

No description provided.

Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

Hello @isabella-janssen! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: isabella-janssen
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@isabella-janssen isabella-janssen changed the title MCO-1518 MCO-1518: Promote MCN to GA Jan 27, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 27, 2025

@isabella-janssen: This pull request references MCO-1518 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen
Copy link
Member Author

/test lint

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, some comments (mostly historical questions from alpha1 days)

// The desired version represents the machine config the node will attempt to update to. This gets set before the machine config operator validates
// the new machine config against the current machine config.
// +required
ConfigVersion MachineConfigNodeSpecMachineConfigVersion `json:"configVersion"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure I really understand why we had this originally. It looks like the status field has current and desired (rendered machineconfigs), and then this spec also has a desired field, and the difference is... the desired here is set before the node validates, and the desired in status is post-validation? Then what's the intent of the "current" in status?

Maybe we should consider consolidating these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More information about the intended purpose of the Spec.ConfigVersion.Desired field is described in the original enhancement. I'd like to leave this field as is for now and get some feedback from more people throughout the review process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djoshy and I discussed this briefly as well. My instinct is that it would make more sense to have the spec reflect the desired config, and the status reflect the current config, to match the MCP/node annotations. (More concretely, drop the desired from MachineConfigNodeStatusMachineConfigVersion)

@isabella-janssen
Copy link
Member Author

/test lint

@isabella-janssen
Copy link
Member Author

/test lint

Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

@isabella-janssen: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@isabella-janssen isabella-janssen changed the title MCO-1518: Promote MCN to GA (Work in progess) MCO-1518: Promote MCN to GA Feb 13, 2025
// The error is an empty string if the image pull and pin is successful.
// +kubebuilder:validation:MaxLength=3276
// +optional
LastSeenError string `json:"lastSeenError,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this via slack/call, but I think what we're settling towards is having the error and the error'ed generation here.

Alternatively, we've noted that the PinnedImageSet status can be beefed up a bit more: https://github.com/openshift/api/pull/2198/files#r1972118770

Although thinking about it more now, one pinnedimageset can be used for multiple pools (since it's designed to be a list of images) so maybe we still want the granular status to be here instead cc @djoshy

// MachineConfigNodeUpdateResumed describes a machine that has resumed normal processes
MachineConfigNodeUpdateResumed StateProgress = "Resumed"
// MachineConfigNodeUpdateCompatible the part of the preparing phase where the mco decides whether it can update
MachineConfigNodeUpdateCompatible StateProgress = "UpdateCompatible"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through the statuses a bit more, the compatibility check now mostly lies within the controller, so maybe this is no longer necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MachineConfigNodeUpdateCompatible phase is the only child phase of the MachineConfigNodeUpdatePrepared phase. Therefore, the reported status of UpdatePrepared is always a mirror of UpdateCompatible. These phases report Unknown when the updates are not compatible and True when the updates are fine to proceed.

Since the prepared phase only encapsulates the compatibility check, there would be no loss in status update granularity, from my perspective, if the MachineConfigNodeUpdateCompatible phase is removed.

// MachineConfigNodeUpdateRebooted describes the part of the post action phase where the node reboots itself
MachineConfigNodeUpdateRebooted StateProgress = "RebootedNode"
// MachineConfigNodeUpdateReloaded describes the part of the post action phase where the node reloads its CRIO service
MachineConfigNodeUpdateReloaded StateProgress = "ReloadedCRIO"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we allow admin defined actions, maybe we shouldn't track specifics anymore

// MachineConfigNodeUpdateFilesAndOS describes the part of the inprogress phase where the nodes file and OS config change
MachineConfigNodeUpdateFilesAndOS StateProgress = "AppliedFilesAndOS"
// MachineConfigNodeUpdateCordoned describes the part of the completing phase where the node cordons
MachineConfigNodeUpdateCordoned StateProgress = "Cordoned"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description seems to imply that these would be only flipped to true while cordon/uncordoning, which in practice should be almost instant. Maybe not worth making them individual fields?

Copy link
Member Author

@isabella-janssen isabella-janssen Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are flipped to Unknown if the node fails to cordon or uncordon. Alternatively, they flip to True once the node cordon or uncordon is complete. As with all MCN statuses, when a node is fully updated (desired config = current config), these statuses are listed as False.

How often does a node fail to cordon/uncordon? Is it helpful to have this level of granularity for understanding why a node fails to update?

Similar to the sentiment in the update compatibility comment, MachineConfigNodeUpdateUncordoned is the only child phase of MachineConfigNodeUpdateComplete, so removing the MachineConfigNodeUpdateUncordoned phase would not result in a loss in status update granularity. However, if we decide to keep the cordoned reporting it might be good to be consistent and keep uncordoned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to continue any discussion via #2201 instead, apologies for splitting the conversation. Just to cap this point out,

How often does a node fail to cordon/uncordon? Is it helpful to have this level of granularity for understanding why a node fails to update?

This almost never happens (only if the API itself is inaccessible), and the MCO generally retries and drowns any temporary error out almost immediately. The failures we see here is (very rarely) on the opposite side, where the MCD thinks its uncordoned the node, but was not able to successfully do so, leaving the state of the node as cordoned until manual intervention. For these cases, the best we could determine was that there was some weird API behaviour.

Alternatively, if the intent of these states is to bridge the existing gap where a cordon on the node came from not the MCO, they may be worth it to keep. The idea being that if a node was manually cordoned or cordoned by another component, but the MCN's MachineConfigNodeUpdateUncordoned is true, then we know the MCC had successfully done so and the current cordon's ownership is us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants