Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Increasing number of nodes with 'start task failed' #341

Closed
JelleVE opened this issue Mar 5, 2020 · 6 comments
Closed

Increasing number of nodes with 'start task failed' #341

JelleVE opened this issue Mar 5, 2020 · 6 comments

Comments

@JelleVE
Copy link

JelleVE commented Mar 5, 2020

Problem Description

I am encountering the following behavior with Azure Batch. I am using Shipyard to start a pool of 500 low-priority nodes to perform a list of 400.000 tasks. The pool size is managed using auto-scaling.

During the upscaling of the pool I typically start to encounter 'start task failed' errors. Usually, this starts with just a few nodes, but this seems to increase steadily over time. The stdout.txt file on these nodes contains the following error:

2020-03-05T08:23:43,784166638+00:00 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0)
2020-03-05T08:23:58,876629647+00:00 - ERROR - Error response from daemon: Get https://mcr.microsoft.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-03-05T08:23:58,878254953+00:00 - ERROR - No fallback registry specified, terminating

I have tried rebooting the affected nodes (multiple times). For some nodes, this seems to work, for others it doesn't. In the latter case, the stdout.txt file contains the following information:

2020-03-05T10:07:32,638562000+00:00 - ERROR - /mnt/batch/tasks/volatile/.batch_shipyard_cascade_failed file exists, assuming cascade failure during node prep

Now, I can fix this problem by ssh'ing into the affected nodes, manually deleting the /mnt/batch/tasks/volatile/.batch_shipyard_cascade_failed file (as well as the /mnt/batch/tasks/volatile/.batch_shipyard_node_prep_finished file, just for good measure) and rebooting said nodes. However, I don't think this is the intended way of fixing this. If not: what should I do when I encounter this problem?

Batch Shipyard Version

3.9.1

@alfpark
Copy link
Collaborator

alfpark commented Mar 5, 2020

There are two issues here. The first is an ephemeral error, Microsoft's container registry was not available during that time. You can specify a fallback registry to help improve robustness for these situations. See the shipyard misc mirror-images command; you'll also need to specify the fallback registry and credentials in the appropriate config files. Excluding that, typically a reboot should fix this.

In the second issue, you need to examine the cascade-*.log files in the startup/wd directory to see what the actual error was. If provisioning directly from the commandline, then these files are downloaded automatically (usually) to your working directory.

@JelleVE
Copy link
Author

JelleVE commented Mar 6, 2020

To give you a timeline of what has happened:

  • Yesterday, we created a pool of 250 nodes. On some of these nodes, we encountered the 'start task failed' issue. The ones that were still in an error state after a couple of a couple reboots, we checked for cascade-*.log files (in the startup/wd folder on the node itself), but there were none there. After applying the method described in the opening post, we managed to get the pool in a stable state (100% of the nodes running) and monitored it throughout the day.
  • Around 8AM this morning the situation all of a sudden changed. All of the nodes crashed (with plenty of tasks left to complete), as can be seen in the graph below. Right now I don't have any indication as to why this happened. It is, however, something that has been happening consistently over the past couple of days. The tasks on these nodes fail with exit code 125.
    Whenever this happens, we remove the pool and job, calculate which tasks still need to be processed, and rerun from scratch again. A non-consistent number of hours later, everything crashes and we start again.
    image
  • When I rerun some of these tasks, new nodes are being started and they seem to run just fine. These nodes, however, do have a cascade-*.log file.
2020-03-06 09:02:39,771.771Z INFO cascade.py::_setup_logger:154 22:MainThread logger initialized, log file: /mnt/batch/tasks/startup/wd/cascade-docker-20200306T090239.log
2020-03-06 09:02:39,771.771Z INFO cascade.py::main:785 22:MainThread max concurrent downloads: 10
2020-03-06 09:02:39,772.772Z INFO cascade.py::main:804 22:MainThread container mode: DOCKER
2020-03-06 09:02:39,827.827Z INFO cascade.py::distribute_global_resources:758 22:MainThread 1 global resources matching container mode "docker"
2020-03-06 09:02:39,868.868Z ERROR cascade.py::write:124 22:MainThread Client-Request-ID=3b4d3a80-5f89-11ea-ad8d-0242ac110002 Retry policy did not allow for a retry: Server-Timestamp=Fri, 06 Mar 2020 09:02:39 GMT, Server-Request-ID=e155b818-601e-0010-4c95-f3f125000000, HTTP status code=409, Exception=There is already a lease present. ErrorCode: LeaseAlreadyPresent<?xml version="1.0" encoding="utf-8"?><Error><Code>LeaseAlreadyPresent</Code><Message>There is already a lease present.RequestId:e155b818-601e-0010-4c95-f3f125000000Time:2020-03-06T09:02:39.8668152Z</Message></Error>.

2020-03-06 09:02:39,868.868Z ERROR cascade.py::flush:128 22:MainThread <__main__.StandardStreamLogger object at 0x7f090ca61cd0>
2020-03-06 09:02:39,878.878Z ERROR cascade.py::write:124 22:MainThread Client-Request-ID=3b532cc8-5f89-11ea-9e1b-0242ac110002 Retry policy did not allow for a retry: Server-Timestamp=Fri, 06 Mar 2020 09:02:39 GMT, Server-Request-ID=e155b81f-601e-0010-4f95-f3f125000000, HTTP status code=409, Exception=There is already a lease present. ErrorCode: LeaseAlreadyPresent<?xml version="1.0" encoding="utf-8"?><Error><Code>LeaseAlreadyPresent</Code><Message>There is already a lease present.RequestId:e155b81f-601e-0010-4f95-f3f125000000Time:2020-03-06T09:02:39.8768251Z</Message></Error>.

2020-03-06 09:02:39,878.878Z ERROR cascade.py::flush:128 22:MainThread <__main__.StandardStreamLogger object at 0x7f090ca61cd0>
2020-03-06 09:02:39,890.890Z ERROR cascade.py::write:124 22:MainThread Client-Request-ID=3b54a814-5f89-11ea-ac20-0242ac110002 Retry policy did not allow for a retry: Server-Timestamp=Fri, 06 Mar 2020 09:02:39 GMT, Server-Request-ID=e155b821-601e-0010-5195-f3f125000000, HTTP status code=409, Exception=There is already a lease present. ErrorCode: LeaseAlreadyPresent<?xml version="1.0" encoding="utf-8"?><Error><Code>LeaseAlreadyPresent</Code><Message>There is already a lease present.RequestId:e155b821-601e-0010-5195-f3f125000000Time:2020-03-06T09:02:39.8888359Z</Message></Error>.

2020-03-06 09:02:39,890.890Z ERROR cascade.py::flush:128 22:MainThread <__main__.StandardStreamLogger object at 0x7f090ca61cd0>
2020-03-06 09:02:39,899.899Z DEBUG cascade.py::_direct_download_resources_async:631 22:MainThread blob lease 0525116a-be89-47c9-8714-acf97c767efd acquired for resource docker:xyz.azurecr.io/img:tag
2020-03-06 09:02:39,900.900Z INFO cascade.py::_pull_and_save:526 22:Thread-1 pulling docker image xyz.azurecr.io/img:tag
2020-03-06 09:02:39,900.900Z DEBUG cascade.py::_pull:511 22:Thread-1 pulling command: docker pull xyz.azurecr.io/img:tag
2020-03-06 09:06:57,045.045Z DEBUG cascade.py::_pull_and_save:531 22:Thread-1 tag: Pulling from img
35c102085707: Pulling fs layer
251f5509d51d: Pulling fs layer
8e829fe70a46: Pulling fs layer
6001e1789921: Pulling fs layer
1ea7d78ee921: Pulling fs layer
139cf959c852: Pulling fs layer
3521870bd973: Pulling fs layer
53e3b61a6a61: Pulling fs layer
6b388b8cc968: Pulling fs layer
4a46c70d2fa1: Pulling fs layer
ba4e4fa798a7: Pulling fs layer
725c505619e6: Pulling fs layer
8ae8f79512fc: Pulling fs layer
c4d5726f7121: Pulling fs layer
89684f271feb: Pulling fs layer
261b91159aaa: Pulling fs layer
cf5a135463f1: Pulling fs layer
8fbaa78943c0: Pulling fs layer
fd8513d5b7ae: Pulling fs layer
84a4d0683a20: Pulling fs layer
38fe21e46d24: Pulling fs layer
6868c9466669: Pulling fs layer
6001e1789921: Waiting
1ea7d78ee921: Waiting
139cf959c852: Waiting
89684f271feb: Waiting
3521870bd973: Waiting
53e3b61a6a61: Waiting
261b91159aaa: Waiting
6b388b8cc968: Waiting
cf5a135463f1: Waiting
8fbaa78943c0: Waiting
84a4d0683a20: Waiting
6868c9466669: Waiting
38fe21e46d24: Waiting
4a46c70d2fa1: Waiting
8ae8f79512fc: Waiting
c4d5726f7121: Waiting
ba4e4fa798a7: Waiting
725c505619e6: Waiting
8e829fe70a46: Download complete
6001e1789921: Verifying Checksum
6001e1789921: Download complete
251f5509d51d: Verifying Checksum
251f5509d51d: Download complete
35c102085707: Verifying Checksum
35c102085707: Download complete
3521870bd973: Verifying Checksum
3521870bd973: Download complete
1ea7d78ee921: Verifying Checksum
1ea7d78ee921: Download complete
53e3b61a6a61: Download complete
6b388b8cc968: Verifying Checksum
6b388b8cc968: Download complete
4a46c70d2fa1: Verifying Checksum
4a46c70d2fa1: Download complete
35c102085707: Pull complete
ba4e4fa798a7: Verifying Checksum
ba4e4fa798a7: Download complete
251f5509d51d: Pull complete
8e829fe70a46: Pull complete
6001e1789921: Pull complete
1ea7d78ee921: Pull complete
725c505619e6: Verifying Checksum
725c505619e6: Download complete
c4d5726f7121: Verifying Checksum
c4d5726f7121: Download complete
139cf959c852: Verifying Checksum
139cf959c852: Download complete
261b91159aaa: Verifying Checksum
261b91159aaa: Download complete
89684f271feb: Verifying Checksum
89684f271feb: Download complete
8fbaa78943c0: Verifying Checksum
8fbaa78943c0: Download complete
8ae8f79512fc: Verifying Checksum
8ae8f79512fc: Download complete
fd8513d5b7ae: Download complete
38fe21e46d24: Verifying Checksum
38fe21e46d24: Download complete
6868c9466669: Verifying Checksum
6868c9466669: Download complete
84a4d0683a20: Verifying Checksum
84a4d0683a20: Download complete
cf5a135463f1: Verifying Checksum
cf5a135463f1: Download complete
139cf959c852: Pull complete
3521870bd973: Pull complete
53e3b61a6a61: Pull complete
6b388b8cc968: Pull complete
4a46c70d2fa1: Pull complete
ba4e4fa798a7: Pull complete
725c505619e6: Pull complete
8ae8f79512fc: Pull complete
c4d5726f7121: Pull complete
89684f271feb: Pull complete
261b91159aaa: Pull complete
cf5a135463f1: Pull complete
8fbaa78943c0: Pull complete
fd8513d5b7ae: Pull complete
84a4d0683a20: Pull complete
38fe21e46d24: Pull complete
6868c9466669: Pull complete
Digest: sha256:a1b53bbb43277a27ef2d6c73e3157405f56b12656d058cd2961126af471ee245
Status: Downloaded newer image for xyz.azurecr.io/img:tag

2020-03-06 09:06:57,045.045Z DEBUG cascade.py::_pull_and_save:553 22:Thread-1 took 257.145267 sec to pull docker image xyz.azurecr.io/img:tag
2020-03-06 09:06:57,045.045Z INFO cascade.py::_inc_resource_count:662 22:Thread-1 all 1 global resources of container mode "docker" loaded
2020-03-06 09:06:57,127.127Z DEBUG cascade.py::run:404 22:Thread-1 blob lease released for docker:xyz.azurecr.io/img:tag
2020-03-06 09:06:57,249.249Z WARNING cascade.py::download_monitor_async:705 22:MainThread singularity cache dir /mnt/singularity/cache does not exist
2020-03-06 09:06:57,249.249Z WARNING cascade.py::download_monitor_async:721 22:MainThread singularity sypgp dir /mnt/singularity/sypgp does not exist
2020-03-06 09:06:57,251.251Z ERROR cascade.py::flush:128 22:MainThread <__main__.StandardStreamLogger object at 0x7f090ca61cd0>
2020-03-06 09:06:57,251.251Z ERROR cascade.py::flush:128 22:MainThread <__main__.StandardStreamLogger object at 0x7f090ca61cd0>
2020-03-06 09:06:57,252.252Z ERROR cascade.py::flush:128 22:MainThread <__main__.StandardStreamLogger object at 0x7f090ca61cd0>

We are using shipyard through the command line interface. Am I wrong to assume that this implies that the log files won't be downloaded automatically?

@alfpark
Copy link
Collaborator

alfpark commented Mar 6, 2020

  • Yesterday, we created a pool of 250 nodes. On some of these nodes, we encountered the 'start task failed' issue. The ones that were still in an error state after a couple of a couple reboots, we checked for cascade-*.log files (in the startup/wd folder on the node itself), but there were none there. After applying the method described in the opening post, we managed to get the pool in a stable state (100% of the nodes running) and monitored it throughout the day.

If you encounter start task failed state and there are no cascade-*.log files, that means that your nodes are failing before the docker image pulls even begin. See my previous reply - it could be that the Microsoft container registry was degraded or unavailable. This is usually a rare event. However, you can create a fallback as specified above. This requires additional setup and maintenance as you upgrade Shipyard versions.

  • Around 8AM this morning the situation all of a sudden changed. All of the nodes crashed (with plenty of tasks left to complete), as can be seen in the graph below. Right now I don't have any indication as to why this happened. It is, however, something that has been happening consistently over the past couple of days. The tasks on these nodes fail with exit code 125.

This is your program exiting with that exit code, and I would be surprised if Shipyard has anything to do with this error. You will need to investigate why the task/process/container is exiting with that exit code. Also ensure that your task/process/container are not OOM'ing or running the machine out of disk space.

alfpark added a commit that referenced this issue Mar 6, 2020
@JelleVE
Copy link
Author

JelleVE commented Mar 11, 2020

Thank you for the info.
Over the past few days, we have been rerunning the remainder of the tasks. The 'start task failed' issue has been mostly resolved. However, the problem with the simultaneously failing tasks/crashing nodes seems to persist. Each night, after +- 17 hours of processing, the batch process starts to crash. The machines are not running out of disk space and all tasks OOM'ing at the same time should be an extremely rare event (as the tasks are relatively heterogeneous). I should be able to monitor the process more closely tomorrow and will report back (either here or on the Batch repo, if more appropriate)

@JelleVE
Copy link
Author

JelleVE commented Mar 17, 2020

I managed to watch it crash. At that point, the pool of 500 nodes had been running for +- 14 hours with +- 15.000 tasks finished. All running tasks were still being handled as intended. However, once a node finished running its current task and tried to pick up a new one, the following error occurred (stderr.txt):

docker: Error response from daemon: error while creating mount source path '/mnt/batch/tasks/mounts/azblob-glblob-output': mkdir /mnt/batch/tasks/mounts/azblob-glblob-output: file exists.

This is all the output that is on the node. With this as output, it speeds through all pending tasks until all are in a failed state.

@alfpark
Copy link
Collaborator

alfpark commented Mar 17, 2020

You will need to examine the blobfuse logs to understand this behavior. Potentially additional mount options may be needed. Please see https://github.com/Azure/azure-storage-fuse for more information about blobfuse.

@alfpark alfpark closed this as completed Jun 5, 2020
alfpark added a commit that referenced this issue Mar 20, 2023
alfpark added a commit that referenced this issue Mar 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants