-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Work in progess) MCO-1518: Promote MCN to GA #2171
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
Hello @isabella-janssen! Some important instructions when contributing to openshift/api: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: isabella-janssen The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@isabella-janssen: This pull request references MCO-1518 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass, some comments (mostly historical questions from alpha1 days)
// The desired version represents the machine config the node will attempt to update to. This gets set before the machine config operator validates | ||
// the new machine config against the current machine config. | ||
// +required | ||
ConfigVersion MachineConfigNodeSpecMachineConfigVersion `json:"configVersion"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure I really understand why we had this originally. It looks like the status field has current and desired (rendered machineconfigs), and then this spec also has a desired field, and the difference is... the desired here is set before the node validates, and the desired in status is post-validation? Then what's the intent of the "current" in status?
Maybe we should consider consolidating these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More information about the intended purpose of the Spec.ConfigVersion.Desired field is described in the original enhancement. I'd like to leave this field as is for now and get some feedback from more people throughout the review process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djoshy and I discussed this briefly as well. My instinct is that it would make more sense to have the spec reflect the desired config, and the status reflect the current config, to match the MCP/node annotations. (More concretely, drop the desired from MachineConfigNodeStatusMachineConfigVersion)
/test lint |
/test lint |
@isabella-janssen: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
// The error is an empty string if the image pull and pin is successful. | ||
// +kubebuilder:validation:MaxLength=3276 | ||
// +optional | ||
LastSeenError string `json:"lastSeenError,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this via slack/call, but I think what we're settling towards is having the error and the error'ed generation here.
Alternatively, we've noted that the PinnedImageSet status can be beefed up a bit more: https://github.com/openshift/api/pull/2198/files#r1972118770
Although thinking about it more now, one pinnedimageset can be used for multiple pools (since it's designed to be a list of images) so maybe we still want the granular status to be here instead cc @djoshy
// MachineConfigNodeUpdateResumed describes a machine that has resumed normal processes | ||
MachineConfigNodeUpdateResumed StateProgress = "Resumed" | ||
// MachineConfigNodeUpdateCompatible the part of the preparing phase where the mco decides whether it can update | ||
MachineConfigNodeUpdateCompatible StateProgress = "UpdateCompatible" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through the statuses a bit more, the compatibility check now mostly lies within the controller, so maybe this is no longer necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MachineConfigNodeUpdateCompatible
phase is the only child phase of the MachineConfigNodeUpdatePrepared
phase. Therefore, the reported status of UpdatePrepared
is always a mirror of UpdateCompatible
. These phases report Unknown
when the updates are not compatible and True
when the updates are fine to proceed.
- Compatibility check failed due to reconcilable error
- Compatibility check failed due to error calculating post config change actions
- Compatibility check succeeds
Since the prepared phase only encapsulates the compatibility check, there would be no loss in status update granularity, from my perspective, if the MachineConfigNodeUpdateCompatible
phase is removed.
// MachineConfigNodeUpdateRebooted describes the part of the post action phase where the node reboots itself | ||
MachineConfigNodeUpdateRebooted StateProgress = "RebootedNode" | ||
// MachineConfigNodeUpdateReloaded describes the part of the post action phase where the node reloads its CRIO service | ||
MachineConfigNodeUpdateReloaded StateProgress = "ReloadedCRIO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we allow admin defined actions, maybe we shouldn't track specifics anymore
// MachineConfigNodeUpdateFilesAndOS describes the part of the inprogress phase where the nodes file and OS config change | ||
MachineConfigNodeUpdateFilesAndOS StateProgress = "AppliedFilesAndOS" | ||
// MachineConfigNodeUpdateCordoned describes the part of the completing phase where the node cordons | ||
MachineConfigNodeUpdateCordoned StateProgress = "Cordoned" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description seems to imply that these would be only flipped to true while cordon/uncordoning, which in practice should be almost instant. Maybe not worth making them individual fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are flipped to Unknown
if the node fails to cordon or uncordon. Alternatively, they flip to True
once the node cordon or uncordon is complete. As with all MCN statuses, when a node is fully updated (desired config = current config), these statuses are listed as False
.
How often does a node fail to cordon/uncordon? Is it helpful to have this level of granularity for understanding why a node fails to update?
Similar to the sentiment in the update compatibility comment, MachineConfigNodeUpdateUncordoned
is the only child phase of MachineConfigNodeUpdateComplete
, so removing the MachineConfigNodeUpdateUncordoned
phase would not result in a loss in status update granularity. However, if we decide to keep the cordoned reporting it might be good to be consistent and keep uncordoned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to continue any discussion via #2201 instead, apologies for splitting the conversation. Just to cap this point out,
How often does a node fail to cordon/uncordon? Is it helpful to have this level of granularity for understanding why a node fails to update?
This almost never happens (only if the API itself is inaccessible), and the MCO generally retries and drowns any temporary error out almost immediately. The failures we see here is (very rarely) on the opposite side, where the MCD thinks its uncordoned the node, but was not able to successfully do so, leaving the state of the node as cordoned until manual intervention. For these cases, the best we could determine was that there was some weird API behaviour.
Alternatively, if the intent of these states is to bridge the existing gap where a cordon on the node came from not the MCO, they may be worth it to keep. The idea being that if a node was manually cordoned or cordoned by another component, but the MCN's MachineConfigNodeUpdateUncordoned is true, then we know the MCC had successfully done so and the current cordon's ownership is us.
No description provided.