-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silent delocalizing failure #165
Comments
While I am using preemptible VMs in this run, I've tried using non-preemptible to same effect. Also, I set retries to 6, and I expect each task to finish in around 20 minutes. |
Hi @sjgosai , I'm curious to see if there are any other failures that didn't make it to the logs files. Could you please grab the internal-id of a failed task (projects/sabeti-encode/operations/8834123416523977731 in this case) and pass it to to:
The events/errors here may make it more clear what is going on. Thanks, |
Thanks for helping me troubleshoot!
|
Hi @sjgosai , the error with code 10 and 'The assigned worker has failed to complete the operation' usually indicates a preemption, or possibly some other kind of transient error. Do you see the same error with non-preemptible machines? My recommendation would be to retry the job, or to run it on non-preemptible machines. |
Thanks for poking around @wnojopra . I'm pretty sure I've done some tests with non-preemptibles, but let me do it once more and save the logs. |
The problem with Pipelines API error code 10 has been impacting workflows run by Cromwell as well. The Terra team has published an article about this, which is worth a quick read: For https://github.com/DataBiosphere/dsub/blob/master/docs/retries.md |
It looks like I need to use |
I would expect this specific error code 10 to only happen on preemptible machines. If you're using a non-preemptible machine, I would expect your error rate to be extremely low. If you're seeing a high rate of transient errors with non-preemptible machines, do please share your logs with us. Also keep in mind that the issue isn't specific to delocalization. Preemptible machines can be preempted at any point of the run (though most happen near the start). |
Here is some info on my last test. I used non-preemptible VMs but I did use
And then on the re-try:
The |
Here are the
Second:
|
Finally, here's the
|
Hi @sjgosai ! For this last example, there is no indication that delocalization had actually started. It looks like "Worker" on the node failed to check in prior to delocalization. Even though your user-command did finish, it is very possible that you were right on the edge of being out of memory and the OOM killer may have killed the Pipelines API Worker. Our first recommendation here is to increase the amount of memory available on the VM. See if that makes a difference in the success-without-retry rate. |
Thanks @mbookman it seems like that's working. I'm just a bit surprised because I had originally tested my code with my test data on a Thanks for your help! Edit: |
Hello! I'm trying to use
dsub
with the--tasks
option to run an analysis in 20 chunks. Curiously, the*.logs
indicate that the script runs to completion for every task, but only some random subset execute the delocalizing. Furthermore, the tasks that don't delocalize don't throw any kind of error captured in the*.logs
.dstat -f
, however, identifies the tasks that failed.Here's an example of a success:
And a failure:
dsub version: 0.3.2
The text was updated successfully, but these errors were encountered: