Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Krane 2.1.6 unable to deploy due to latency reaching cluster #805

Closed
sjagoe opened this issue Feb 15, 2021 · 8 comments · Fixed by #806 or #813
Closed

Krane 2.1.6 unable to deploy due to latency reaching cluster #805

sjagoe opened this issue Feb 15, 2021 · 8 comments · Fixed by #806 or #813

Comments

@sjagoe
Copy link

sjagoe commented Feb 15, 2021

Bug report

Krane 2.1.6 has an extremely short timeout for cluster resource discovery, which can result in inability to deploy, particularly for people working remotely with moderate-latency connections, or with long distance (i.e. high latency) to the cluster under control.

The issue is introduced with #778, which uses kubectl get ... --request-timeout=1 during resource discovery. It appears to no longer be possible to deploy with krane due to this timeout if the user has higher latency. By introducing delay (of about 195ms) in to my own connection to have a total latency of 200ms to the cluster, I can reproduce the issue below. Adjust tc delay as appropriate to reproduce.

Expected behavior:

When there is moderate latency, krane is still able to deploy

Actual behavior:

Cluster resource discovery fails with kubectl error

------------------------------------Phase 1: Initializing deploy------------------------------------
All required parameters and files are present
Discovering resources:
The following command failed and will be retried (attempt 1/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following command failed and will be retried (attempt 2/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following command failed and will be retried (attempt 3/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following command failed and will be retried (attempt 4/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The following command failed (attempt 5/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)


------------------------------------------Result: FAILURE-------------------------------------------
Error retrieving customresourcedefinition: unable to connect to the server: net/http: request canceled (client.timeout exceeded while awaiting headers)

Version(s) affected: 2.1.6

Steps to Reproduce

  1. Simulate a higher latency connection
$ sudo tc qdisc add dev eth0 root netem delay 200ms
  1. Deploy something using krane and observe the error.
krane deploy my-ns my-cluster.example.com -f .
  1. Remove the added latency
$ sudo tc qdisc delete dev eth0 root netem
@dturn
Copy link
Contributor

dturn commented Feb 17, 2021

Thanks for filing a bug report. We definitely didn't expect that to happen. Do you think a 2 second timeout instead of the 1 second we have right now https://github.com/Shopify/krane/blob/master/lib/krane/cluster_resource_discovery.rb#L104 would be enough?

@sjagoe
Copy link
Author

sjagoe commented Feb 17, 2021

My feeling is two seconds would probably be okay. Our developers who have faced this don't have it all the time; it has been an intermittent problem on a sligtly unstable Internet connection, so I think a small increase would have a large effect on mitigating the probelm. I suspect in most cases even if there is a failure then, the retry may take care of it in most cases, too.

@opsidao
Copy link

opsidao commented Feb 18, 2021

This also broke deployments for us, seems like our EKS cluster takes a bit longer than 1 second to resolve the discovery queries and we had to downgrade to 2.1.5 for our deployments to work at all again.

I believe that it would be nice to have a new --discovery-timeout cli argument to override this optionally, if no one is already working on a solution I can try implementing something myself, just let me know if you're already doing something to avoid duplicating efforts.

If adding this argument seems like too much, I would actually go with a more generous default_timeout to avoid issues in general, maybe even 4-5 seconds (though we would probably be fine for the most part with 2 seconds).

@norman
Copy link

norman commented Feb 18, 2021

I observed the same issue while deploying to a cluster in the USA from a connection in Argentina. 2.1.6 times out, while 2.1.5 works fine.

[WARN][2021-02-18 10:59:27 -0300]       The following command failed and will be retried (attempt 1/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:27 -0300]       W0218 10:59:27.594214   76237 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARN][2021-02-18 10:59:29 -0300]       The following command failed and will be retried (attempt 2/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:29 -0300]       W0218 10:59:29.271530   76239 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[WARN][2021-02-18 10:59:32 -0300]       The following command failed and will be retried (attempt 3/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:32 -0300]       W0218 10:59:32.042666   76244 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARN][2021-02-18 10:59:36 -0300]       The following command failed and will be retried (attempt 4/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:36 -0300]       W0218 10:59:36.916133   76250 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I tried issuing the failing kubectl commands with a timeout of 2 seconds and they worked, but with a 1 second timeout they consistently fail.

@sjagoe
Copy link
Author

sjagoe commented Feb 18, 2021

Given the further reports here, I'd agree with @opsidao that a slightly larger timeout would be better.

In the good cases, nobody will notice the higher timeout. In the really bad cases, it's the difference between working (although slowly) and not.

@timflapper
Copy link
Contributor

Even with two second timeout, we're seeing this issue consistently when connecting to clusters on the other side of the world.

@opsidao
Copy link

opsidao commented Mar 2, 2021

Same here, I did a quick test with one of our projects and it still did a couple of retries though it succedded in the end, but that was enough to make me decide to leave our version fixed to 2.1.5 to avoid any headaches.

I wonder, if the way to go is hardcoding a timeout, why not go with something obnoxious like 10 seconds?

I would rather wait ten seconds for a deployment to fail (in the discovery phase) than to have random failures because the cluster, the runner or the network are slightly overloaded at that moment.

Ideally though, in my opinion, this should be parametrizable either through a CLI argument or through an environment variable, because each project's deploying performance is unpredictable and finding a low number that will satisfy all environments might prove tricky, but putting a large number should avoid issues for most people.

@dturn
Copy link
Contributor

dturn commented Mar 9, 2021

I would rather wait ten seconds for a deployment to fail (in the discovery phase) than to have random failures because the cluster, the runner or the network are slightly overloaded at that moment.

In the discovery phase that's true. But the deploy phase is made up of many kubectl calls and its a bad time to find out that the network is overloaded because if some mutating calls work and then you hit one that times out you're going to be in an inconsistent state.

That being said, having one call with a shorter tiemout than the others is a bad idea I'll work on a PR to bump it back to the same as the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants