Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle EFA security group eni leakage #534

Merged
merged 6 commits into from
Jan 3, 2025
Merged

handle EFA security group eni leakage #534

merged 6 commits into from
Jan 3, 2025

Conversation

wwvela
Copy link
Contributor

@wwvela wwvela commented Jan 2, 2025

Issue #, if available:

  • Fix the node group deletion failed caused by the ENI leakage for EFA security group.
  • The EFA Security group is created in the unmanaged-nodegroup-efa.yaml which is a customized security group specified in the node launch template, so it will be created and deleted during unmanaged nodegroup stack creation and deletion.
  • The reason why our old leaked eni clean up in infra.go doesn't work as expected is because we are using customized security group in node launch template instead of the default cluster security group. It will be deleted during the node group stack deletion after ASG tear down happened in here, but the old leaked eni clean up is handled during cluster cleaned up after the node group stack already teared down.

To resolve above issue, we can use the default cluster security group and update the EFA required ingress and egress in it instead of creating a new security group for EFA in node group stack. And the eni leakage handler in infra.go we already have can handle eni leakage in the default security group for us.

Test in local below are logs in reverse order:

January 03, 2025 at 00:02 (UTC-8:00)
I0103 08:02:04.602210 14 infra.go:245] deleted infrastructure stack: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 03, 2025 at 00:00 (UTC-8:00)
I0103 08:00:34.844267 14 infra.go:232] waiting for infrastructure stack to be deleted: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 03, 2025 at 00:00 (UTC-8:00)
I0103 08:00:34.708164 14 infra.go:222] deleting infrastructure stack: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.977376 14 cluster.go:164] waiting for cluster to be deleted: arn:aws:eks:eu-south-1:235947055528:cluster/kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.799182 14 infra.go:329] deleted 2 leaked ENI(s)!
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.799203 14 cluster.go:154] deleting cluster...
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.447365 14 infra.go:321] deleting leaked ENI: eni-081330c11228821c9
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.100503 14 infra.go:321] deleting leaked ENI: eni-0cec5610de4df59a5
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.959241 14 infra.go:314] waiting for 2 leaked ENI(s) to become available: [eni-0cec5610de4df59a5 eni-081330c11228821c9]
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.725848 14 node.go:593] nodegroup does not exist: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.588672 14 node.go:637] deleted unmanaged nodegroup stack: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279-unmanaged-nodegroup
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.588692 14 node.go:588] deleting nodegroup...

@wwvela wwvela requested a review from cartermckinnon January 2, 2025 05:16
Copy link
Contributor

@ndbaker1 ndbaker1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this fix 👏

@cartermckinnon
Copy link
Member

Can you add some more detail about how this leak is happening? What is creating the ENI(s) and not cleaning them up? What is special about the EFA stack vs. our existing leaked ENI cleanup?

// deleteLeakedENIs deletes Elastic Network Interfaces that may have been allocated (and left behind) by the VPC CNI.
// These leaked ENIs will prevent deletion of their associated subnets and security groups.
func (m *InfrastructureManager) deleteLeakedENIs() error {

@wwvela
Copy link
Contributor Author

wwvela commented Jan 2, 2025

Can you add some more detail about how this leak is happening? What is creating the ENI(s) and not cleaning them up? What is special about the EFA stack vs. our existing leaked ENI cleanup?

// deleteLeakedENIs deletes Elastic Network Interfaces that may have been allocated (and left behind) by the VPC CNI.
// These leaked ENIs will prevent deletion of their associated subnets and security groups.
func (m *InfrastructureManager) deleteLeakedENIs() error {

The EFA Security group is created in the unmanaged-nodegroup-efa.yaml which is a customized security group specified in the node launch template, so it will be created and deleted during unmanaged nodegroup stack creation and deletion.

The leaked eni attached to the EFA Security group comes from node group ASG instance.

NodeGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Ref ResourceId
MixedInstancesPolicy:
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateId: !Ref NodeLaunchTemplate
Version: !GetAtt NodeLaunchTemplate.LatestVersionNumber
DesiredCapacity: !Ref NodeCount
MinSize: !Ref NodeCount
MaxSize: !Ref NodeCount
VPCZoneIdentifier: !Ref SubnetIds
Tags:
- Key: Name
Value: !Sub "${ClusterName}-Node"
PropagateAtLaunch: true
- Key: !Sub "kubernetes.io/cluster/${ClusterName}"
Value: owned
PropagateAtLaunch: true

The reason why our old leaked eni clean up doesn't work as expected is because we are using customized security group in node launch template instead of the default cluster security group. It need to be deleted during the node group stack deletion after ASG tear down, but the old leaked eni clean up is handled during cluster cleaned up after the node group stack already teared down.

@wwvela
Copy link
Contributor Author

wwvela commented Jan 2, 2025

The reason why we need customized security group instead of default cluster security group is because the EFA-enabled security group is required. According to this doc for EFA and MPI https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security, we need security group with special inbound and outbound rules for EFA

@wwvela
Copy link
Contributor Author

wwvela commented Jan 3, 2025

In commit 3, I use the default cluster security group for EFA by updating the EFA required ingress and egress in it instead of creating a new security group for EFA in node group stack. And the ENI leakage handler in infra.go we already have can handle eni leakage in the default security group for us.

test logs

January 03, 2025 at 00:02 (UTC-8:00)
I0103 08:02:04.602210 14 infra.go:245] deleted infrastructure stack: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 03, 2025 at 00:00 (UTC-8:00)
I0103 08:00:34.844267 14 infra.go:232] waiting for infrastructure stack to be deleted: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 03, 2025 at 00:00 (UTC-8:00)
I0103 08:00:34.708164 14 infra.go:222] deleting infrastructure stack: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.977376 14 cluster.go:164] waiting for cluster to be deleted: arn:aws:eks:eu-south-1:235947055528:cluster/kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.799182 14 infra.go:329] deleted 2 leaked ENI(s)!
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.799203 14 cluster.go:154] deleting cluster...
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.447365 14 infra.go:321] deleting leaked ENI: eni-081330c11228821c9
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:24.100503 14 infra.go:321] deleting leaked ENI: eni-0cec5610de4df59a5
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.959241 14 infra.go:314] waiting for 2 leaked ENI(s) to become available: [eni-0cec5610de4df59a5 eni-081330c11228821c9]
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.725848 14 node.go:593] nodegroup does not exist: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.588672 14 node.go:637] deleted unmanaged nodegroup stack: kubetest2-eksapi-cdd2510e-a5eb-485d-82e8-eb9a6fcee279-unmanaged-nodegroup
	
January 02, 2025 at 23:58 (UTC-8:00)
I0103 07:58:23.588692 14 node.go:588] deleting nodegroup...

@wwvela wwvela merged commit d065921 into aws:main Jan 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants