Make sure application mode clusters aren't stuck in status updating #710

live-wire · 2023-07-27T09:51:07Z

Fixes #709

I am not sure if there is any benefit from setting the status to Restarting, so removed it. It shouldn't be updater's responsibility to change desired state.
In cases of Deploy failed, the status kept looping between: Restarting > Deploying > Failed > Restarting. I believe the reconciler should look at a Failed job and decide to respawn a new one.

regadas

I’m back next week but iirc restart state is actually useful i some stages. I’m just sensing that we might not be catching some state that is causing the state to be stuck. I can help out with this next week.

regadas · 2023-08-02T02:31:37Z

@live-wire do you have a specific job where this is happening? or is it happening for every job?

live-wire · 2023-08-02T12:58:13Z

@regadas Seems to happen on every job (with or without high availability flink properties). To reproduce it, just delete the job manager pod.

Make sure application mode clusters aren't stuck in status updating

ada1013

regadas reviewed Jul 27, 2023

View reviewed changes

Job status takes a brief moment to publish

bbafcc1

regadas approved these changes Aug 17, 2023

View reviewed changes

regadas merged commit bf6cc0e into master Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure application mode clusters aren't stuck in status updating #710

Make sure application mode clusters aren't stuck in status updating #710

live-wire commented Jul 27, 2023

regadas left a comment

regadas commented Aug 2, 2023

live-wire commented Aug 2, 2023

Make sure application mode clusters aren't stuck in status updating #710

Make sure application mode clusters aren't stuck in status updating #710

Conversation

live-wire commented Jul 27, 2023

regadas left a comment

Choose a reason for hiding this comment

regadas commented Aug 2, 2023

live-wire commented Aug 2, 2023