-
Notifications
You must be signed in to change notification settings - Fork 121
Increasing number of nodes with 'start task failed' #341
Comments
There are two issues here. The first is an ephemeral error, Microsoft's container registry was not available during that time. You can specify a fallback registry to help improve robustness for these situations. See the In the second issue, you need to examine the |
If you encounter start task failed state and there are no
This is your program exiting with that exit code, and I would be surprised if Shipyard has anything to do with this error. You will need to investigate why the task/process/container is exiting with that exit code. Also ensure that your task/process/container are not OOM'ing or running the machine out of disk space. |
Thank you for the info. |
I managed to watch it crash. At that point, the pool of 500 nodes had been running for +- 14 hours with +- 15.000 tasks finished. All running tasks were still being handled as intended. However, once a node finished running its current task and tried to pick up a new one, the following error occurred (
This is all the output that is on the node. With this as output, it speeds through all pending tasks until all are in a failed state. |
You will need to examine the blobfuse logs to understand this behavior. Potentially additional mount options may be needed. Please see https://github.com/Azure/azure-storage-fuse for more information about blobfuse. |
Problem Description
I am encountering the following behavior with Azure Batch. I am using Shipyard to start a pool of 500 low-priority nodes to perform a list of 400.000 tasks. The pool size is managed using auto-scaling.
During the upscaling of the pool I typically start to encounter 'start task failed' errors. Usually, this starts with just a few nodes, but this seems to increase steadily over time. The
stdout.txt
file on these nodes contains the following error:I have tried rebooting the affected nodes (multiple times). For some nodes, this seems to work, for others it doesn't. In the latter case, the
stdout.txt
file contains the following information:Now, I can fix this problem by ssh'ing into the affected nodes, manually deleting the
/mnt/batch/tasks/volatile/.batch_shipyard_cascade_failed
file (as well as the/mnt/batch/tasks/volatile/.batch_shipyard_node_prep_finished
file, just for good measure) and rebooting said nodes. However, I don't think this is the intended way of fixing this. If not: what should I do when I encounter this problem?Batch Shipyard Version
3.9.1
The text was updated successfully, but these errors were encountered: