Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(test/nvidia): Add Containerd Test to NVIDIA Suite for ECR and Systemd Validation #575

Merged
merged 11 commits into from
Feb 5, 2025

Conversation

mattcjo
Copy link
Contributor

@mattcjo mattcjo commented Jan 31, 2025

Issue #, if available:
Addresses the known bug in nvidia-container-toolkit 1.17.0 (bug(nvidia-container-toolkit): incorrect containerd config #2041) which can drop EKS defaults on Amazon Linux 2 (AL2) AMIs.

Description of changes:

  • Introduces a new containerd test under test/cases/nvidia, verifying each node’s /etc/containerd/config.toml includes:

    1. ECR references (.ecr.) for the sandbox image (avoid fallback to registry.k8s.io),
    2. The NVIDIA GPU runtime (nvidia-container-runtime),
    3. Systemd cgroup settings (systemd_cgroup = true or SystemdCgroup = true).
  • Deploys a DaemonSet that mounts /etc/containerd/config.toml and runs inline grep checks. If any required line is missing, the pod fails fast, preventing the DaemonSet from becoming Ready.

  • Prevents duplicated logs by only retrieving logs in the Assess phase on failure. On success, pods remain running (tail -f /dev/null), letting the DaemonSet achieve Ready status.

  • Ensures a short 1-minute timeout for quick feedback. If the buggy containerd config is detected (missing ECR or systemd lines), the test times out and provides container logs pinpointing which checks failed.

Testing:

  • Validated against an old AL2 AMI (amazon-eks-gpu-node-1.31-v20241106) lacking certain EKS defaults, where pods fail the grep checks and never become Ready, producing logs identifying the missing lines.
go test -v -tags=e2e . -run TestContainerdConfig
2025/01/31 05:50:43 No node type specified. Using the node type g6.2xlarge in the node groups.
=== RUN   TestContainerdConfig
=== RUN   TestContainerdConfig/containerd-config-check
2025/01/31 05:50:48 [Setup] Applying containerd-check DaemonSet manifest.
=== RUN   TestContainerdConfig/containerd-config-check/DaemonSet_becomes_ready
2025/01/31 05:50:48 [Assess] Waiting up to 1 minute for containerd-check DS to become Ready...
    containerd_test.go:57: [Assess] containerd-check DS did not become Ready: client rate limiter Wait returned an error: context deadline exceeded
    containerd_test.go:100: Pod containerd-check-h5lkl status: Running
    containerd_test.go:106: === Logs from containerd-check-h5lkl/grep-containerd ===
        === content read by the container ===
        version = 2

        [plugins]

          [plugins."io.containerd.grpc.v1.cri"]

            [plugins."io.containerd.grpc.v1.cri".containerd]
              default_runtime_name = "nvidia"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
                  privileged_without_host_devices = false
                  runtime_engine = ""
                  runtime_root = ""
                  runtime_type = "io.containerd.runc.v2"

                  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                    BinaryName = "/usr/bin/nvidia-container-runtime"
        FAIL: no sandbox_image line found
        === debug ===
    containerd_test.go:59: containerd-check DS not Ready within 1 minute
=== NAME  TestContainerdConfig/containerd-config-check
    containerd_test.go:67: [Teardown] Removing containerd-check DS (no additional logs).
    containerd_test.go:71: [Teardown] containerd-check DS removed successfully.
--- FAIL: TestContainerdConfig (60.09s)
    --- FAIL: TestContainerdConfig/containerd-config-check (60.09s)
        --- FAIL: TestContainerdConfig/containerd-config-check/DaemonSet_becomes_ready (60.03s)
FAIL
FAIL    github.com/aws/aws-k8s-tester/test/cases/nvidia 72.213s
FAIL
  • Verified on an updated AL2 AMI (amazon-eks-gpu-node-1.31-v20250123) where all EKS defaults are present, leading to a Passing DS that becomes Ready within the timeout.
go test -v -tags=e2e . -run TestContainerdConfig
2025/01/31 05:54:04 No node type specified. Using the node type g6.2xlarge in the node groups.
=== RUN   TestContainerdConfig
=== RUN   TestContainerdConfig/containerd-config-check
2025/01/31 05:54:09 [Setup] Applying containerd-check DaemonSet manifest.
=== RUN   TestContainerdConfig/containerd-config-check/DaemonSet_becomes_ready
2025/01/31 05:54:09 [Assess] Waiting up to 1 minute for containerd-check DS to become Ready...
2025/01/31 05:54:14 [Assess] containerd-check DS is Ready.
=== NAME  TestContainerdConfig/containerd-config-check
    containerd_test.go:67: [Teardown] Removing containerd-check DS (no additional logs).
    containerd_test.go:71: [Teardown] containerd-check DS removed successfully.
--- PASS: TestContainerdConfig (5.05s)
    --- PASS: TestContainerdConfig/containerd-config-check (5.05s)
        --- PASS: TestContainerdConfig/containerd-config-check/DaemonSet_becomes_ready (5.01s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/nvidia 17.075s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ferences, systemd cgroup usage, and NVIDIA runtime.
…failures more cleanly. Now the test only retrieves DaemonSet logs once on failure, preventing repeated “FAIL” messages while still exposing the misconfigurations.
Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small things, thanks for adding this!

spec:
containers:
- name: grep-containerd
image: public.ecr.aws/docker/library/busybox:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use public.ecr.aws/amazonlinux/amazonlinux?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. Changed in latest commit.

}

// printDaemonSetPodLogs retrieves logs from each DS pod once (when DS readiness fails).
func printDaemonSetPodLogs(ctx context.Context, cfg *envconf.Config, namespace, labelSelector string, t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem useful, should they go in pkg/e2e for reuse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to internal/e2e.

echo "=== content read by the container ==="
cat /host-etc/containerd/config.toml

grep -Eq '\.ecr\.' /host-etc/containerd/config.toml || {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be improved a bit with a trick we used in some AMI scripting: https://github.com/awslabs/amazon-eks-ami/blob/ee54f50bf1bd40be9a66145e7564a1f6f0c5c500/templates/al2/runtime/pull-sandbox-image.sh#L3

that will set the env var sandbox_image equal to the value in the config file, and then you can make your assertions on that (instead of the entire config file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice... utilized in latest commits.

…e <(grep ...). This addresses the code review comments by providing a richer shell environment and enabling more precise checks against the ECR pause image field.
… reuse. This consolidates manifest and logging helpers under a single alias, simplifying import logic across test suites.
Comment on lines 17 to 18
containers:
- name: grep-containerd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container name could also be containerd-check?

Comment on lines 70 to 84
var out string
buf := make([]byte, 4096)
for {
n, readErr := stream.Read(buf)
if n > 0 {
out += string(buf[:n])
}
if readErr == io.EOF {
break
}
if readErr != nil {
return out, fmt.Errorf("error reading logs: %w", readErr)
}
}
return out, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just want io.ReadAll here: https://pkg.go.dev/io#ReadAll

Comment on lines 87 to 94
// getClientset returns a typed Kubernetes clientset from the given rest.Config.
func getClientset(restConfig *rest.Config) (*kubernetes.Clientset, error) {
cs, err := kubernetes.NewForConfig(restConfig)
if err != nil {
return nil, fmt.Errorf("failed to create clientset: %w", err)
}
return cs, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this function takes the same number of lines as the function itself -- you don't need it 😄

@mattcjo mattcjo merged commit ed57d72 into aws:main Feb 5, 2025
9 checks passed
@mattcjo mattcjo deleted the feat/containerd-test branch February 5, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants