-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[elasticsearch] readiness probe failing at startup when cluster status is yellow #215
Comments
I reduced the readiness probe to just the http check, and it works:
So something must have been wrong with the other part, namely
|
Would it be possible to make the readiness probe simpler, something like this ? |
It seems to work now, not sure how... If all continues well I will close this tomorrow end of day. But I'm still curious about the multi line bash script health check when we could have something like the above link? |
Ok I deleted a pod to test, and it is not starting up with a ready status, the issue is if the cluster state is yellow, the pod never becomes ready, but if it is ready then the cluster status goes yellow, it remains ready. Imho this is not ideal, either it is ready regardless of yellow or it is not ready if yellow (I much prefer the first of the two options). Be at at startup or during runtime. |
After a lot of messing around, I've come to see that setting |
Waiting for the cluster to become green is very important. It's needed to make sure that Elasticsearch will remain available during upgrades to the statefulset and to the Kubernetes nodes themselves. This is combined with the default pod disruption budget to make sure that only 1 pod is ever unavailable at any time. This is the only safe way to do updates because you can't assume how many primaries and replicas each index has, or assume where each copy is.
Could you go into more detail about what your setup looks like? And what problem you are having? If you made a change to the statefulset only 1 pod should be updated at a time and all data should remain available during the entire restart. If more than 1 pod became unready then it sounds like a problem, if the other pods in the cluster were still ready then it sounds like everything was functioning as expected.
I would really discourage you from doing this. The setting was made configurable so that the cluster aware health check can be disabled for client only nodes. Doing this on any other kind of (stateful) node is not recommended and will lead to downtime. If your cluster really can't survive losing a single pod then it sounds like their is another issue, changing this setting is only masking the actual problem. |
Thanks @Crazybus for taking the time to shed light on this and explain. As for my setup, it is trying to keep 30 days worth of logs. Each index is roughly 150GB. What I noticed with health check as is, if i delete 1 pod, after many hours it is still not ready. to me that is not normal. I will put back health check param |
What does the output of There are a few reasons that this can be slow, fixing why it is slow is the issue here, you always want to wait for the cluster to become green again during restarts. Some ideas of things you should be looking at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-recovery.html can also give some useful information about what might be going wrong if things really aren't recovering at all. |
Hi @Crazybus , sorry for the delay in getting back to you. Once again many thanks for such a thorough explanation. I've deleted a pod (out of the 3 pods, there are always 2 with high cpu consumption, I deleted one of those).
And it was ready quite quickly, just under 4 minutes.
I did the same with the pod using less cpu out of the 3, and it came back even quicker, just over a minute and a half:
I'm not sure how to reproduce what was being experience when I opened the issue. It might have to do with me having allocated insufficient RAM at the time of opening the issue, or not. In any case.
It was interesting to see the unassigned shards be assigned as I ran this command continuously. As for indices:
Could you please tell me at first glance why you may think the heap size is off?
This issue can be closed as it works now! |
It was just a hunch based of your heap size and amount of data. I would need to see some monitoring data from the cluster to have a better idea. However it really depends on so many factors and how the cluster is being used and queried. If things are performing well then there is nothing to worry about. Here is an old blog post that goes into some more details about what can influence this: https://www.elastic.co/blog/a-heap-of-trouble |
I was having the same issue. Earlier I have set a password with a length of fewer than 20 characters. but after setting a password with 20 character length in 100s pod status is turn in ready. |
Thanks for your hint. My cluster now gets green. The default params doesn't work. |
Hi,
I tried keeping more and more indices and eventually got a circuit breaker error. So I increased ram and heap size.
I don't know if the above is related but now the elasticsearch pods do not pass the readiness probe, if i remove the readiness probe, things seem to work fine, all logs get collected and stored in elasticsearch.
Any idea what might be going on?
The text was updated successfully, but these errors were encountered: