-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(test/nvidia): Add Containerd Test to NVIDIA Suite for ECR and Systemd Validation #575
Conversation
…ferences, systemd cgroup usage, and NVIDIA runtime.
…failures more cleanly. Now the test only retrieves DaemonSet logs once on failure, preventing repeated “FAIL” messages while still exposing the misconfigurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small things, thanks for adding this!
spec: | ||
containers: | ||
- name: grep-containerd | ||
image: public.ecr.aws/docker/library/busybox:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to use public.ecr.aws/amazonlinux/amazonlinux
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. Changed in latest commit.
test/cases/nvidia/containerd_test.go
Outdated
} | ||
|
||
// printDaemonSetPodLogs retrieves logs from each DS pod once (when DS readiness fails). | ||
func printDaemonSetPodLogs(ctx context.Context, cfg *envconf.Config, namespace, labelSelector string, t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem useful, should they go in pkg/e2e
for reuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to internal/e2e
.
echo "=== content read by the container ===" | ||
cat /host-etc/containerd/config.toml | ||
|
||
grep -Eq '\.ecr\.' /host-etc/containerd/config.toml || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be improved a bit with a trick we used in some AMI scripting: https://github.com/awslabs/amazon-eks-ami/blob/ee54f50bf1bd40be9a66145e7564a1f6f0c5c500/templates/al2/runtime/pull-sandbox-image.sh#L3
that will set the env var sandbox_image
equal to the value in the config file, and then you can make your assertions on that (instead of the entire config file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice... utilized in latest commits.
…e <(grep ...). This addresses the code review comments by providing a richer shell environment and enabling more precise checks against the ECR pause image field.
… reuse. This consolidates manifest and logging helpers under a single alias, simplifying import logic across test suites.
… PrintDaemonSetPodLogs helper is included.
containers: | ||
- name: grep-containerd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
container name could also be containerd-check
?
internal/e2e/logs.go
Outdated
var out string | ||
buf := make([]byte, 4096) | ||
for { | ||
n, readErr := stream.Read(buf) | ||
if n > 0 { | ||
out += string(buf[:n]) | ||
} | ||
if readErr == io.EOF { | ||
break | ||
} | ||
if readErr != nil { | ||
return out, fmt.Errorf("error reading logs: %w", readErr) | ||
} | ||
} | ||
return out, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just want io.ReadAll
here: https://pkg.go.dev/io#ReadAll
internal/e2e/logs.go
Outdated
// getClientset returns a typed Kubernetes clientset from the given rest.Config. | ||
func getClientset(restConfig *rest.Config) (*kubernetes.Clientset, error) { | ||
cs, err := kubernetes.NewForConfig(restConfig) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to create clientset: %w", err) | ||
} | ||
return cs, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this function takes the same number of lines as the function itself -- you don't need it 😄
Co-authored-by: Nick Baker <[email protected]>
Issue #, if available:
Addresses the known bug in nvidia-container-toolkit
1.17.0
(bug(nvidia-container-toolkit): incorrect containerd config #2041) which can drop EKS defaults on Amazon Linux 2 (AL2) AMIs.Description of changes:
Introduces a new containerd test under
test/cases/nvidia
, verifying each node’s/etc/containerd/config.toml
includes:.ecr.
) for the sandbox image (avoid fallback toregistry.k8s.io
),nvidia-container-runtime
),systemd_cgroup = true
orSystemdCgroup = true
).Deploys a DaemonSet that mounts
/etc/containerd/config.toml
and runs inline grep checks. If any required line is missing, the pod fails fast, preventing the DaemonSet from becoming Ready.Prevents duplicated logs by only retrieving logs in the Assess phase on failure. On success, pods remain running (
tail -f /dev/null
), letting the DaemonSet achieve Ready status.Ensures a short 1-minute timeout for quick feedback. If the buggy containerd config is detected (missing ECR or systemd lines), the test times out and provides container logs pinpointing which checks failed.
Testing:
amazon-eks-gpu-node-1.31-v20241106
) lacking certain EKS defaults, where pods fail the grep checks and never become Ready, producing logs identifying the missing lines.amazon-eks-gpu-node-1.31-v20250123
) where all EKS defaults are present, leading to a Passing DS that becomes Ready within the timeout.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.