-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HQ worker keeps running without jobs #784
Comments
Hi, sorry for the late reply. If a worker does not receive any task to compute in 5 minutes, it should turn itself off. However, there can be a situation where HQ repeatedly spawns allocations and workers even though it has waiting tasks. HQ cannot currently guess if a given allocation will be able to run a given task or not (because of resource requirements). It has a small heuristic not to spawn allocations that run for e.g. 10 minutes if tasks have a time request of 30 minutes, but it does not do much more. So the following could happen:
|
Thanks for the info. All my jobs have the same definition (a quarter of node's CPU + time-request of couple of hours under the batch queue limit). So, HQ should not have a problem with executing it but who knows what has happened. I would be curious, is there something in HQ debug that would give information that HQ is unable to execute a task and why? |
Currently, not (@spirali unless I'm mistaken). It's quite tricky to figure that out and log it by default, although we have been thinking about some explicit querying support, e.g. to ask "why is task X not executed by worker Y"? |
Hi,
I have observed situation when there are HQ workers and batch jobs running but all HQ jobs are waiting. I assume the HQ closes workers when they do not run anything. So, I assume something blocked that. I have logged into the WN and listed processes. I would welcome any insight into what could keep the worker running.
Listing of running processes:
The text was updated successfully, but these errors were encountered: