bug(nvidia-container-toolkit): incorrect `containerd` config #2041

cartermckinnon · 2024-11-09T20:09:04Z

What happened:

A bug in nvidia-container-toolkit 1.17.0 results in a containerd config that omits some EKS defaults. Some symptoms we're aware of:

The pause container image used by containerd will be the default (registry.k8s.io/pause:3.5) instead of the regional ECR image that EKS provides. If your nodes have network access to registry.k8s.io, this will work fine. If your nodes can't reach registry.k8s.io, your nodes will not be able to create pods.
The cgroup driver used by containerd will be incorrect, resulting in your pods being placed in the wrong part of the cgroup tree. This may impact your telemetry.

The bug has been fixed in 1.17.1 and EKS will release an AMI with this version of the nvidia-container-toolkit as soon as possible.

Bug: NVIDIA/nvidia-container-toolkit@a06d838
Fix: NVIDIA/nvidia-container-toolkit@1995925

The text was updated successfully, but these errors were encountered:

InsomniaCoder · 2024-11-11T14:03:22Z

not sure if it's related. but since we updated from AL2 v20241024 to v20241106 we found that the
/etc/containerd/config.toml is totally changed, and it removes the

 [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

which causes the new nodes to not be able to pull containers from private registries we have set up.

cartermckinnon · 2024-11-11T17:34:27Z

@InsomniaCoder that is the same bug, yes. Fixed version is going out in an AMI release now: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241109

mselim00 · 2024-11-11T23:56:43Z

The release for this bug fix was completed
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241109

cartermckinnon added the bug Something isn't working label Nov 9, 2024

Issacwww mentioned this issue Nov 11, 2024

bug(nvidia-container-tookit) update version for AL23 GPU in Isolated regions #2044

Merged

mselim00 closed this as completed Nov 11, 2024

cartermckinnon mentioned this issue Nov 14, 2024

bug(AL2): Updated nvidia-ctk to 1.17.0 from 1.16.2 breaks bootstrapping GPU instances #2054

Closed

mattcjo mentioned this issue Jan 31, 2025

feat(test/nvidia): Add Containerd Test to NVIDIA Suite for ECR and Systemd Validation aws/aws-k8s-tester#575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(nvidia-container-toolkit): incorrect `containerd` config #2041

bug(nvidia-container-toolkit): incorrect `containerd` config #2041

cartermckinnon commented Nov 9, 2024 •

edited by ndbaker1

Loading

InsomniaCoder commented Nov 11, 2024 •

edited

Loading

cartermckinnon commented Nov 11, 2024

mselim00 commented Nov 11, 2024

bug(nvidia-container-toolkit): incorrect containerd config #2041

bug(nvidia-container-toolkit): incorrect containerd config #2041

Comments

cartermckinnon commented Nov 9, 2024 • edited by ndbaker1 Loading

InsomniaCoder commented Nov 11, 2024 • edited Loading

cartermckinnon commented Nov 11, 2024

mselim00 commented Nov 11, 2024

bug(nvidia-container-toolkit): incorrect `containerd` config #2041

bug(nvidia-container-toolkit): incorrect `containerd` config #2041

cartermckinnon commented Nov 9, 2024 •

edited by ndbaker1

Loading

InsomniaCoder commented Nov 11, 2024 •

edited

Loading