send a target to a second worker in clustermq parallelism #1287

kendonB · 2020-06-25T03:37:05Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Proposal

I found a case where a dynamic target got really close to finishing but did not while I still had workers up and waiting for work. What I suspect happened was that targets were allocated to workers that then disappeared due to the HPC time limit. What I would have liked to have happened was that drake would recognise that the worker has disappeared then send the target to another worker that is still around.

I believe this would require clustermq to be able to say which workers have disappeared via SLURM in my case.

The text was updated successfully, but these errors were encountered:

wlandau · 2020-06-25T11:03:56Z

Unfortunately, drake has no way of knowing which clustermq workers stopped unexpectedly or which target was running at the time. Maybe follow up on mschubert/clustermq#101.

kendonB added the type: new feature label Jun 25, 2020

kendonB assigned wlandau Jun 25, 2020

wlandau closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

send a target to a second worker in clustermq parallelism #1287

send a target to a second worker in clustermq parallelism #1287

kendonB commented Jun 25, 2020

wlandau commented Jun 25, 2020

send a target to a second worker in clustermq parallelism #1287

send a target to a second worker in clustermq parallelism #1287

Comments

kendonB commented Jun 25, 2020

Prework

Proposal

wlandau commented Jun 25, 2020