status check went boom #826

djfinnoy · 2021-08-23T08:55:25Z

Possibly same issue as #362.

Running v8.3.0, installed via Helm chart, on GKE (1.20.8-gke.900).
The cluster gets bootstrapped by an automated pipeline, and external-secrets is expected to retrieve a single secret from Google Secret Manager.
I works about 50% of the time.

When it doesn't work, I'm seeing the following lines repeated in the pod logs:

{"level":50,"message_time":"2021-08-23T08:31:59.073Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"No watch event for 60000 ms, restarting watcher for *"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"Stopping watch stream for namespace * due to event: END"}
{"level":30,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"starting poller for oauth2-proxy/sm-oauth-github"}
{"level":50,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"

Deleting the pod solves the problem, but I have not checked whether the problem re-occurs.
I never saw this problem before upgrading from 6.2.0 to 8.3.0.

The text was updated successfully, but these errors were encountered:

Sukovitsyn · 2021-10-26T14:34:24Z

I have the same problem in OpenShift bare metal cluster.
Anybody have a workaround to avoid this?

derekwilliamsliquidx · 2021-10-28T13:16:58Z

I have the same issue in Azure Kubernetes 1.20.7.

I had a similar issue in 6.4.0, and this persists in 8.3.2

Kubernetes External Secrets Version: 8.3.2
Vault Cluster backed by postgres: 1.7.3

derekwilliamsliquidx · 2021-10-28T22:53:51Z

This happened to me again at 18:36. Vault cluster was just fine and I could log in. External secrets simply stopped polling. This is affecting production systems. Please advise.

derekwilliamsliquidx · 2021-10-29T14:08:36Z

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

Flydiverny · 2021-10-31T19:12:10Z

This happened again at 10:01am

{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}

I'm not following, this particular log line is to be expected even in a healthy cluster or setup. The operator is watching a HTTP stream and for some reason those are not cleanly closing towards the kubernetes api, so the operator will as a precaution restart the watch every 60 seconds (or whatever is configured). Otherwise the operator could hang forever, not knowing if nothing happened or if there will never be any new events published on the stream, for more background see #362

Flydiverny · 2021-10-31T19:41:31Z

Possibly same issue as #362.

Running v8.3.0, installed via Helm chart, on GKE (1.20.8-gke.900). The cluster gets bootstrapped by an automated pipeline, and external-secrets is expected to retrieve a single secret from Google Secret Manager. I works about 50% of the time.

When it doesn't work, I'm seeing the following lines repeated in the pod logs:
{"level":50,"message_time":"2021-08-23T08:31:59.073Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"No watch event for 60000 ms, restarting watcher for *"}
{"level":30,"message_time":"2021-08-23T08:32:59.123Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"Stopping watch stream for namespace * due to event: END"}
{"level":30,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","msg":"starting poller for oauth2-proxy/sm-oauth-github"}
{"level":50,"message_time":"2021-08-23T08:32:59.131Z","pid":18,"hostname":"external-secrets-kubernetes-external-secrets-b44fd8967-dgplh","payload":{},"msg":"status check went boom for oauth2-proxy/sm-oauth-github"
Deleting the pod solves the problem, but I have not checked whether the problem re-occurs. I never saw this problem before upgrading from 6.2.0 to 8.3.0.

Hard to tell what goes wrong from this, and odd that there's no error being logged since this particular log line (status check went boom) does try to log an error 😕

derekwilliamsliquidx · 2021-11-01T19:12:01Z

This happened again at 10:01am
{"level":30,"message_time":"2021-10-27T18:20:53.906Z","pid":17,"hostname":"kubernetes-external-secrets-69c544c96d-tkncr","msg":"No watch event for 60000 ms, restarting watcher for *"}
I'm not following, this particular log line is to be expected even in a healthy cluster or setup. The operator is watching a HTTP stream and for some reason those are not cleanly closing towards the kubernetes api, so the operator will as a precaution restart the watch every 60 seconds (or whatever is configured). Otherwise the operator could hang forever, not knowing if nothing happened or if there will never be any new events published on the stream, for more background see #362

Yes this is the last logline before KES simply stops upserting. There are NO errors outputted for me to provide you. The poller simply stops polling. This is happening DAILY at this point. We need a fix our I'm going to have to stop using this engine.

Nick-Triller · 2021-11-02T14:54:56Z

I also observed this issue in bare metal Kubernetes Clusters with Vault. Some secrets are not updated anymore with no error besides "status check went boom". This also happened with a completely new ExternalSecret once. The ExternalSecret was not updated at all (no error message in the custom resource).

Kubernetes version: v1.20.11
KES version: 8.3.0
Vault version: 1.7.1+ent

Flydiverny · 2021-11-02T23:13:17Z

I wish I had an easy fix 😄
However I can only advise to look into ESO https://github.com/external-secrets/external-secrets
See #864

moolen · 2021-12-02T10:22:49Z

Can someone provide metrics for KES around the time where it stops polling? I'm particularly interested in cpu/mem/netstat and nodejs_* related metrics. It's odd that the eventloop seems to stop 🤷.
Let me find some time to look into that. I think we can improve a little on the observability side.

moolen · 2021-12-07T08:53:58Z

I couldn't find an issue using the aws provider. I let it run for 3 days, with 1/10/100/500 Kind=ExternalSecret. CPU usage was highish but it was syncing properly.
I opened an PR (see above) that exports additional nodejs_* runtime metrics that will be helpful debugging this issue. Please apply the patch manually or wait for a release.
Please provide some additional information like how many ExternalSecrets you have, your POLLING_INTERVAL and please attach strace to process in the container so we might see where it's stuck.

Refer to: external-secrets/kubernetes-external-secrets#826

github-actions · 2022-03-08T02:22:55Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2022-04-07T02:42:18Z

This issue was closed because it has been stalled for 30 days with no activity.

moolen mentioned this issue Dec 7, 2021

feat: add runtime metrics #877

Merged

j-zimnowoda pushed a commit to linode/apl-core that referenced this issue Jan 3, 2022

fix: downgrade kes

285e3fc

Refer to: external-secrets/kubernetes-external-secrets#826

j-zimnowoda mentioned this issue Jan 3, 2022

fix: downgrade kes linode/apl-core#703

Merged

3 tasks

j-zimnowoda pushed a commit to linode/apl-core that referenced this issue Jan 3, 2022

fix: downgrade kes (#703)

7209bbd

Refer to: external-secrets/kubernetes-external-secrets#826

cjwagner mentioned this issue Feb 7, 2022

Problem on updating the external secret for s390x-knative GoogleCloudPlatform/oss-test-infra#1452

Closed

github-actions bot added the Stale label Mar 8, 2022

github-actions bot closed this as completed Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status check went boom #826

status check went boom #826

djfinnoy commented Aug 23, 2021

Sukovitsyn commented Oct 26, 2021

derekwilliamsliquidx commented Oct 28, 2021 •

edited

Loading

derekwilliamsliquidx commented Oct 28, 2021

derekwilliamsliquidx commented Oct 29, 2021

Flydiverny commented Oct 31, 2021 •

edited

Loading

Flydiverny commented Oct 31, 2021

derekwilliamsliquidx commented Nov 1, 2021

Nick-Triller commented Nov 2, 2021

Flydiverny commented Nov 2, 2021

moolen commented Dec 2, 2021

moolen commented Dec 7, 2021

github-actions bot commented Mar 8, 2022

github-actions bot commented Apr 7, 2022

status check went boom #826

status check went boom #826

Comments

djfinnoy commented Aug 23, 2021

Sukovitsyn commented Oct 26, 2021

derekwilliamsliquidx commented Oct 28, 2021 • edited Loading

derekwilliamsliquidx commented Oct 28, 2021

derekwilliamsliquidx commented Oct 29, 2021

Flydiverny commented Oct 31, 2021 • edited Loading

Flydiverny commented Oct 31, 2021

derekwilliamsliquidx commented Nov 1, 2021

Nick-Triller commented Nov 2, 2021

Flydiverny commented Nov 2, 2021

moolen commented Dec 2, 2021

moolen commented Dec 7, 2021

github-actions bot commented Mar 8, 2022

github-actions bot commented Apr 7, 2022

derekwilliamsliquidx commented Oct 28, 2021 •

edited

Loading

Flydiverny commented Oct 31, 2021 •

edited

Loading