-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Krane 2.1.6 unable to deploy due to latency reaching cluster #805
Comments
Thanks for filing a bug report. We definitely didn't expect that to happen. Do you think a 2 second timeout instead of the 1 second we have right now https://github.com/Shopify/krane/blob/master/lib/krane/cluster_resource_discovery.rb#L104 would be enough? |
My feeling is two seconds would probably be okay. Our developers who have faced this don't have it all the time; it has been an intermittent problem on a sligtly unstable Internet connection, so I think a small increase would have a large effect on mitigating the probelm. I suspect in most cases even if there is a failure then, the retry may take care of it in most cases, too. |
This also broke deployments for us, seems like our EKS cluster takes a bit longer than 1 second to resolve the discovery queries and we had to downgrade to 2.1.5 for our deployments to work at all again. I believe that it would be nice to have a new If adding this argument seems like too much, I would actually go with a more generous default_timeout to avoid issues in general, maybe even 4-5 seconds (though we would probably be fine for the most part with 2 seconds). |
I observed the same issue while deploying to a cluster in the USA from a connection in Argentina. 2.1.6 times out, while 2.1.5 works fine.
I tried issuing the failing |
Given the further reports here, I'd agree with @opsidao that a slightly larger timeout would be better. In the good cases, nobody will notice the higher timeout. In the really bad cases, it's the difference between working (although slowly) and not. |
Even with two second timeout, we're seeing this issue consistently when connecting to clusters on the other side of the world. |
Same here, I did a quick test with one of our projects and it still did a couple of retries though it succedded in the end, but that was enough to make me decide to leave our version fixed to 2.1.5 to avoid any headaches. I wonder, if the way to go is hardcoding a timeout, why not go with something obnoxious like 10 seconds? I would rather wait ten seconds for a deployment to fail (in the discovery phase) than to have random failures because the cluster, the runner or the network are slightly overloaded at that moment. Ideally though, in my opinion, this should be parametrizable either through a CLI argument or through an environment variable, because each project's deploying performance is unpredictable and finding a low number that will satisfy all environments might prove tricky, but putting a large number should avoid issues for most people. |
In the discovery phase that's true. But the deploy phase is made up of many kubectl calls and its a bad time to find out that the network is overloaded because if some mutating calls work and then you hit one that times out you're going to be in an inconsistent state. That being said, having one call with a shorter tiemout than the others is a bad idea I'll work on a PR to bump it back to the same as the rest. |
Bug report
Krane 2.1.6 has an extremely short timeout for cluster resource discovery, which can result in inability to deploy, particularly for people working remotely with moderate-latency connections, or with long distance (i.e. high latency) to the cluster under control.
The issue is introduced with #778, which uses
kubectl get ... --request-timeout=1
during resource discovery. It appears to no longer be possible to deploy with krane due to this timeout if the user has higher latency. By introducing delay (of about 195ms) in to my own connection to have a total latency of 200ms to the cluster, I can reproduce the issue below. Adjusttc
delay as appropriate to reproduce.Expected behavior:
When there is moderate latency, krane is still able to deploy
Actual behavior:
Cluster resource discovery fails with kubectl error
Version(s) affected: 2.1.6
Steps to Reproduce
The text was updated successfully, but these errors were encountered: