Krane 2.1.6 unable to deploy due to latency reaching cluster #805

sjagoe · 2021-02-15T12:23:25Z

Bug report

Krane 2.1.6 has an extremely short timeout for cluster resource discovery, which can result in inability to deploy, particularly for people working remotely with moderate-latency connections, or with long distance (i.e. high latency) to the cluster under control.

The issue is introduced with #778, which uses kubectl get ... --request-timeout=1 during resource discovery. It appears to no longer be possible to deploy with krane due to this timeout if the user has higher latency. By introducing delay (of about 195ms) in to my own connection to have a total latency of 200ms to the cluster, I can reproduce the issue below. Adjust tc delay as appropriate to reproduce.

Expected behavior:

When there is moderate latency, krane is still able to deploy

Actual behavior:

Cluster resource discovery fails with kubectl error

------------------------------------Phase 1: Initializing deploy------------------------------------
All required parameters and files are present
Discovering resources:
The following command failed and will be retried (attempt 1/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following command failed and will be retried (attempt 2/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following command failed and will be retried (attempt 3/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following command failed and will be retried (attempt 4/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The following command failed (attempt 5/5): kubectl get CustomResourceDefinition --context\=my-cluster.example.com --output\=json --request-timeout\=1
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)


------------------------------------------Result: FAILURE-------------------------------------------
Error retrieving customresourcedefinition: unable to connect to the server: net/http: request canceled (client.timeout exceeded while awaiting headers)

Version(s) affected: 2.1.6

Steps to Reproduce

Simulate a higher latency connection

$ sudo tc qdisc add dev eth0 root netem delay 200ms

Deploy something using krane and observe the error.

krane deploy my-ns my-cluster.example.com -f .

Remove the added latency

$ sudo tc qdisc delete dev eth0 root netem

The text was updated successfully, but these errors were encountered:

dturn · 2021-02-17T17:16:56Z

Thanks for filing a bug report. We definitely didn't expect that to happen. Do you think a 2 second timeout instead of the 1 second we have right now https://github.com/Shopify/krane/blob/master/lib/krane/cluster_resource_discovery.rb#L104 would be enough?

sjagoe · 2021-02-17T17:36:26Z

My feeling is two seconds would probably be okay. Our developers who have faced this don't have it all the time; it has been an intermittent problem on a sligtly unstable Internet connection, so I think a small increase would have a large effect on mitigating the probelm. I suspect in most cases even if there is a failure then, the retry may take care of it in most cases, too.

opsidao · 2021-02-18T08:16:58Z

This also broke deployments for us, seems like our EKS cluster takes a bit longer than 1 second to resolve the discovery queries and we had to downgrade to 2.1.5 for our deployments to work at all again.

I believe that it would be nice to have a new --discovery-timeout cli argument to override this optionally, if no one is already working on a solution I can try implementing something myself, just let me know if you're already doing something to avoid duplicating efforts.

If adding this argument seems like too much, I would actually go with a more generous default_timeout to avoid issues in general, maybe even 4-5 seconds (though we would probably be fine for the most part with 2 seconds).

norman · 2021-02-18T16:57:09Z

I observed the same issue while deploying to a cluster in the USA from a connection in Argentina. 2.1.6 times out, while 2.1.5 works fine.

[WARN][2021-02-18 10:59:27 -0300]       The following command failed and will be retried (attempt 1/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:27 -0300]       W0218 10:59:27.594214   76237 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARN][2021-02-18 10:59:29 -0300]       The following command failed and will be retried (attempt 2/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:29 -0300]       W0218 10:59:29.271530   76239 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[WARN][2021-02-18 10:59:32 -0300]       The following command failed and will be retried (attempt 3/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:32 -0300]       W0218 10:59:32.042666   76244 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[WARN][2021-02-18 10:59:36 -0300]       The following command failed and will be retried (attempt 4/5): kubectl get CustomResourceDefinition --context\=hni --output\=json --request-timeout\=1
[WARN][2021-02-18 10:59:36 -0300]       W0218 10:59:36.916133   76250 transport.go:260] Unable to cancel request for *exec.roundTripper
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I tried issuing the failing kubectl commands with a timeout of 2 seconds and they worked, but with a 1 second timeout they consistently fail.

sjagoe · 2021-02-18T16:59:57Z

Given the further reports here, I'd agree with @opsidao that a slightly larger timeout would be better.

In the good cases, nobody will notice the higher timeout. In the really bad cases, it's the difference between working (although slowly) and not.

timflapper · 2021-03-02T05:18:17Z

Even with two second timeout, we're seeing this issue consistently when connecting to clusters on the other side of the world.

opsidao · 2021-03-02T16:50:40Z

Same here, I did a quick test with one of our projects and it still did a couple of retries though it succedded in the end, but that was enough to make me decide to leave our version fixed to 2.1.5 to avoid any headaches.

I wonder, if the way to go is hardcoding a timeout, why not go with something obnoxious like 10 seconds?

I would rather wait ten seconds for a deployment to fail (in the discovery phase) than to have random failures because the cluster, the runner or the network are slightly overloaded at that moment.

Ideally though, in my opinion, this should be parametrizable either through a CLI argument or through an environment variable, because each project's deploying performance is unpredictable and finding a low number that will satisfy all environments might prove tricky, but putting a large number should avoid issues for most people.

dturn · 2021-03-09T23:13:36Z

I would rather wait ten seconds for a deployment to fail (in the discovery phase) than to have random failures because the cluster, the runner or the network are slightly overloaded at that moment.

In the discovery phase that's true. But the deploy phase is made up of many kubectl calls and its a bad time to find out that the network is overloaded because if some mutating calls work and then you hit one that times out you're going to be in an inconsistent state.

That being said, having one call with a shorter tiemout than the others is a bad idea I'll work on a PR to bump it back to the same as the rest.

timothysmith0609 mentioned this issue Feb 22, 2021

bump default timeout to 2 sec #806

Merged

timothysmith0609 closed this as completed in #806 Feb 22, 2021

dturn reopened this Mar 9, 2021

dturn mentioned this issue Mar 9, 2021

Use same timeout everywhere in resource discovery #813

Merged

dturn closed this as completed in #813 Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Krane 2.1.6 unable to deploy due to latency reaching cluster #805

Krane 2.1.6 unable to deploy due to latency reaching cluster #805

sjagoe commented Feb 15, 2021 •

edited

Loading

dturn commented Feb 17, 2021

sjagoe commented Feb 17, 2021

opsidao commented Feb 18, 2021

norman commented Feb 18, 2021 •

edited

Loading

sjagoe commented Feb 18, 2021

timflapper commented Mar 2, 2021

opsidao commented Mar 2, 2021

dturn commented Mar 9, 2021

Krane 2.1.6 unable to deploy due to latency reaching cluster #805

Krane 2.1.6 unable to deploy due to latency reaching cluster #805

Comments

sjagoe commented Feb 15, 2021 • edited Loading

Bug report

Steps to Reproduce

dturn commented Feb 17, 2021

sjagoe commented Feb 17, 2021

opsidao commented Feb 18, 2021

norman commented Feb 18, 2021 • edited Loading

sjagoe commented Feb 18, 2021

timflapper commented Mar 2, 2021

opsidao commented Mar 2, 2021

dturn commented Mar 9, 2021

sjagoe commented Feb 15, 2021 •

edited

Loading

norman commented Feb 18, 2021 •

edited

Loading