-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Can this be prioritized, it doesn't make sense that a request is made to Vault for each pod, it should really be one per deployment, since all the pods for a given deployment are using the same secret. With large scale services, sercret rotation becomes an unusable feature in fear of 429s from the secret provider or simply crashing the secret provider |
/remove-lifecycle stale |
/lifecycle frozen |
What steps did you take and what happened:
We have an Azure Key Vault with average requests lower than the KeyVault throttling limit.
and recently ran into outage when started using CSI Driver with auto rotation with 3 hours interval.
We had regular scale up at peak hours and that triggered Key Vault throttling and continue to be throttling for hours until we disable auto rotation.
As a result, none of the services could be created as it's stuck in ContainerCreating state and we had to revert back to use KeyVaultAgent.
After investigation we discovered a few issue with reconciler implementation:
Current auto rotation design is inefficient and not scalable as it's rotating secrets per pod, which makes a lot of unnecessary requests. If a deployment running 1000 replicas is downloading 10 secrets each. The amount of extra requests made is 10000 vs 10 if rotating per deployment.
According to this post Understanding how many calls are made to KeyVault?
However, workqueue does not handle 429 and no exponential backoff.
secrets-store-csi-driver/pkg/rotation/reconciler.go
Lines 401 to 405 in bf86dbf
So each task in the queue doesn't know anything about 429 throttling and it will just continue to process these requests in the workqueue without backoff. So it doesn't give any time for Key Vault recover as it continues to make these requests.
This is amplified when there are thousands of nodes pulling from the same KeyVault.
https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/bf86dbf98ad3e32a0f55e52d6a411abd6784f7fb/pkg/rotation/reconciler.go#L242C1-L243C1
What did you expect to happen:
Anything else you would like to add:
Regarding #1, the current design is not scalable.
Would love to hear from the team what's the plan for optimizing this going forward.
Also, does the polling interval starts counting when all rotation requests are processed? Or it starts as soon as List SPCPC is invoked?
If it starts as soon as List SPCPC is invoked and add these task to workqueue, Does that mean new iteration will add more rotation task to the workqueue despite previous iteration didn't finish them and the queue just continue to pile up?
Which provider are you using:
Azure Key Vault.
Environment:
kubectl version
): v1.27.9The text was updated successfully, but these errors were encountered: