Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solution deployment timeout converging to final state which prevents actual error shows in instance state #498

Open
RemindD opened this issue Sep 27, 2024 · 3 comments · May be fixed by #541
Labels
bug Something isn't working P0

Comments

@RemindD
Copy link
Contributor

RemindD commented Sep 27, 2024

K8s target provider will wait for deployment to be updated/removed since #436. The timeout is 5 minutes for each component.
Assuming we have an instance to deploy 5 components to 2 targets. If all of them timeout, it takes 50 minutes to do one round deployment.
And the retry mechanism in solution Reconcile function makes the case even worse.

In the other side, k8s controller has a fixed timeout to wait for summary in terminated state. If not, it fails with timeout error like below which hides the actual error for each components.
image

Another side effect is that the update reconcile will block reconcile for further changes. This is because we lack cancellation token to cancel an ongoing job.

We should try to avoid k8s controller timeout by removing unnecessary retry and limiting timeout value.

@RemindD
Copy link
Contributor Author

RemindD commented Sep 27, 2024

I would treat it as a P0 bug @msftcoderdjw

@msftcoderdjw
Copy link
Contributor

We need to achieve cancel deployment.

@msftcoderdjw msftcoderdjw added bug Something isn't working P0 labels Oct 10, 2024
@msftcoderdjw
Copy link
Contributor

@linyguo to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0
Projects
None yet
2 participants