Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(nvidia-container-toolkit): incorrect containerd config #2041

Closed
cartermckinnon opened this issue Nov 9, 2024 · 3 comments
Closed

bug(nvidia-container-toolkit): incorrect containerd config #2041

cartermckinnon opened this issue Nov 9, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@cartermckinnon
Copy link
Member

cartermckinnon commented Nov 9, 2024

What happened:

A bug in nvidia-container-toolkit 1.17.0 results in a containerd config that omits some EKS defaults. Some symptoms we're aware of:

  • The pause container image used by containerd will be the default (registry.k8s.io/pause:3.5) instead of the regional ECR image that EKS provides. If your nodes have network access to registry.k8s.io, this will work fine. If your nodes can't reach registry.k8s.io, your nodes will not be able to create pods.
  • The cgroup driver used by containerd will be incorrect, resulting in your pods being placed in the wrong part of the cgroup tree. This may impact your telemetry.

The bug has been fixed in 1.17.1 and EKS will release an AMI with this version of the nvidia-container-toolkit as soon as possible.


Bug: NVIDIA/nvidia-container-toolkit@a06d838
Fix: NVIDIA/nvidia-container-toolkit@1995925

@cartermckinnon cartermckinnon added the bug Something isn't working label Nov 9, 2024
@InsomniaCoder
Copy link

InsomniaCoder commented Nov 11, 2024

not sure if it's related. but since we updated from AL2 v20241024 to v20241106 we found that the
/etc/containerd/config.toml is totally changed, and it removes the

 [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

which causes the new nodes to not be able to pull containers from private registries we have set up.

@cartermckinnon
Copy link
Member Author

@InsomniaCoder that is the same bug, yes. Fixed version is going out in an AMI release now: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241109

@mselim00
Copy link
Contributor

The release for this bug fix was completed
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants