Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent delocalizing failure #165

Closed
sjgosai opened this issue Jul 25, 2019 · 13 comments
Closed

Silent delocalizing failure #165

sjgosai opened this issue Jul 25, 2019 · 13 comments

Comments

@sjgosai
Copy link

sjgosai commented Jul 25, 2019

Hello! I'm trying to use dsub with the --tasks option to run an analysis in 20 chunks. Curiously, the *.logs indicate that the script runs to completion for every task, but only some random subset execute the delocalizing. Furthermore, the tasks that don't delocalize don't throw any kind of error captured in the *.logs. dstat -f, however, identifies the tasks that failed.

Here's an example of a success:

- create-time: '2019-07-25 02:16:39.297447'
  dsub-version: v0-3-2
  end-time: '2019-07-25 02:32:30.556849'
  envs:
    CHUNK: '3'
  events:
  - name: start
    start-time: 2019-07-25 06:16:42.171100+00:00
  - name: pulling-image
    start-time: 2019-07-25 06:17:32.995391+00:00
  - name: localizing-files
    start-time: 2019-07-25 06:18:34.308943+00:00
  - name: running-docker
    start-time: 2019-07-25 06:18:36.658863+00:00
  - name: delocalizing-files
    start-time: 2019-07-25 06:32:24.497567+00:00
  - name: ok
    start-time: 2019-07-25 06:32:30.556849+00:00
  input-recursives: {}
  inputs:
    INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
  internal-id: projects/sabeti-encode/operations/1351805964445161078
  job-id: python--sagergosai--190725-021637-18
  job-name: python
  labels: {}
  last-update: '2019-07-25 02:32:30.556849'
  logging: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.4.1.log
  mounts: {}
  output-recursives: {}
  outputs:
    OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__3_20.bed
  provider: google-v2
  provider-attributes:
    accelerators: []
    boot-disk-size: 250
    cpu_platform: ''
    disk-size: 200
    disk-type: pd-standard
    enable-stackdriver-monitoring: false
    instance-name: google-pipelines-worker-fae4230d454b3f6e1038535cbcb0da50
    machine-type: n1-standard-8
    network: ''
    preemptible: true
    regions: []
    service-account: default
    subnetwork: ''
    use_private_address: false
    zone: us-west2-c
    zones:
    - us-central1-a
    - us-central1-b
    - us-central1-c
    - us-central1-f
    - us-east1-b
    - us-east1-c
    - us-east1-d
    - us-east4-a
    - us-east4-b
    - us-east4-c
    - us-west1-a
    - us-west1-b
    - us-west1-c
    - us-west2-a
    - us-west2-b
    - us-west2-c
  script: |-
    #!/usr/bin/env bash
    python /app/hcr-ff/call_peaks.py ${INFILE} ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100
  script-name: python
  start-time: '2019-07-25 02:16:42.171100'
  status: SUCCESS
  status-detail: Success
  status-message: Success
  task-attempt: 1
  task-id: '4'
  user-id: sagergosai

And a failure:

- create-time: '2019-07-25 02:16:39.576571'
  dsub-version: v0-3-2
  end-time: '2019-07-25 02:52:45.047989'
  envs:
    CHUNK: '4'
  events:
  - name: start
    start-time: 2019-07-25 06:16:42.182994+00:00
  - name: pulling-image
    start-time: 2019-07-25 06:17:41.422799+00:00
  - name: localizing-files
    start-time: 2019-07-25 06:18:41.913631+00:00
  - name: running-docker
    start-time: 2019-07-25 06:18:44.379215+00:00
  - name: The assigned worker has failed to complete the operation
    start-time: 2019-07-25 06:52:43.907976+00:00
  input-recursives: {}
  inputs:
    INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
  internal-id: projects/sabeti-encode/operations/8834123416523977731
  job-id: python--sagergosai--190725-021637-18
  job-name: python
  labels: {}
  last-update: '2019-07-25 02:52:45.047989'
  logging: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1.log
  mounts: {}
  output-recursives: {}
  outputs:
    OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__4_20.bed
  provider: google-v2
  provider-attributes:
    accelerators: []
    boot-disk-size: 250
    cpu_platform: ''
    disk-size: 200
    disk-type: pd-standard
    enable-stackdriver-monitoring: false
    instance-name: google-pipelines-worker-1d27f8b0a26375721946e521a550105a
    machine-type: n1-standard-8
    network: ''
    preemptible: true
    regions: []
    service-account: default
    subnetwork: ''
    use_private_address: false
    zone: us-east1-b
    zones:
    - us-central1-a
    - us-central1-b
    - us-central1-c
    - us-central1-f
    - us-east1-b
    - us-east1-c
    - us-east1-d
    - us-east4-a
    - us-east4-b
    - us-east4-c
    - us-west1-a
    - us-west1-b
    - us-west1-c
    - us-west2-a
    - us-west2-b
    - us-west2-c
  script: |-
    #!/usr/bin/env bash
    python /app/hcr-ff/call_peaks.py ${INFILE} ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100
  script-name: python
  start-time: '2019-07-25 02:16:42.182994'
  status: FAILURE
  status-detail: The assigned worker has failed to complete the operation
  status-message: The assigned worker has failed to complete the operation
  task-attempt: 1
  task-id: '5'
  user-id: sagergosai

dsub version: 0.3.2

@sjgosai
Copy link
Author

sjgosai commented Jul 25, 2019

While I am using preemptible VMs in this run, I've tried using non-preemptible to same effect. Also, I set retries to 6, and I expect each task to finish in around 20 minutes.

@wnojopra
Copy link
Contributor

Hi @sjgosai ,

I'm curious to see if there are any other failures that didn't make it to the logs files.

Could you please grab the internal-id of a failed task (projects/sabeti-encode/operations/8834123416523977731 in this case) and pass it to to:

gcloud alpha genomics operations describe <id>

The events/errors here may make it more clear what is going on.

Thanks,
Willy

@sjgosai
Copy link
Author

sjgosai commented Jul 29, 2019

Thanks for helping me troubleshoot!

gcloud alpha genomics operations describe projects/sabeti-encode/operations/8834123416523977731
done: true
error:
  code: 10
  message: The assigned worker has failed to complete the operation
metadata:
  '@type': type.googleapis.com/google.genomics.v2alpha1.Metadata
  createTime: '2019-07-25T06:16:39.576571Z'
  endTime: '2019-07-25T06:52:45.047989Z'
  events:
  - description: Worker released
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.WorkerReleasedEvent
      instance: google-pipelines-worker-1d27f8b0a26375721946e521a550105a
      zone: us-east1-b
    timestamp: '2019-07-25T06:52:45.047989Z'
  - description: The assigned worker has failed to complete the operation
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.FailedEvent
      cause: The assigned worker has failed to complete the operation
      code: ABORTED
    timestamp: '2019-07-25T06:52:43.907976Z'
  - description: Started running "user-command"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 4
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-25T06:18:44.379215Z'
  - description: Stopped running "localization"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStoppedEvent
      actionId: 3
      exitStatus: 0
      stderr: ''
    timestamp: '2019-07-25T06:18:44.215951Z'
  - description: Started running "localization"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 3
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-25T06:18:41.913631Z'
  - description: Stopped running "prepare"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStoppedEvent
      actionId: 2
      exitStatus: 0
      stderr: ''
    timestamp: '2019-07-25T06:18:41.798351Z'
  - description: Started running "prepare"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 2
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-25T06:18:39.704159Z'
  - description: Started running "logging"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 1
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-25T06:18:39.588607Z'
  - description: Stopped pulling "sjgosai/hff-kit:0.1.2"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent
      imageUri: sjgosai/hff-kit:0.1.2
    timestamp: '2019-07-25T06:18:28.799887Z'
  - description: Started pulling "sjgosai/hff-kit:0.1.2"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent
      imageUri: sjgosai/hff-kit:0.1.2
    timestamp: '2019-07-25T06:17:41.422799Z'
  - description: Stopped pulling "python:2.7-slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent
      imageUri: python:2.7-slim
    timestamp: '2019-07-25T06:17:41.298671Z'
  - description: Started pulling "python:2.7-slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent
      imageUri: python:2.7-slim
    timestamp: '2019-07-25T06:17:37.001343Z'
  - description: Stopped pulling "google/cloud-sdk:slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent
      imageUri: google/cloud-sdk:slim
    timestamp: '2019-07-25T06:17:36.886335Z'
  - description: Started pulling "google/cloud-sdk:slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent
      imageUri: google/cloud-sdk:slim
    timestamp: '2019-07-25T06:17:26.121407Z'
  - description: Worker "google-pipelines-worker-1d27f8b0a26375721946e521a550105a"
      assigned in "us-east1-b"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.WorkerAssignedEvent
      instance: google-pipelines-worker-1d27f8b0a26375721946e521a550105a
      machineType: n1-standard-8
      zone: us-east1-b
    timestamp: '2019-07-25T06:16:42.182994Z'
  labels:
    dsub-version: v0-3-2
    job-id: python--sagergosai--190725-021637-18
    job-name: python
    task-attempt: '1'
    task-id: '5'
    user-id: sagergosai
  pipeline:
    actions:
    - commands:
      - -c
      - |
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        while [[ ! -d /google/logs/action/6 ]]; do
          mkdir -p /tmp/continuous_logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/continuous_logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"


          sleep 60s
        done
      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1.log
        STDERR_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1-stderr.log
        STDOUT_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1-stdout.log
        USER_PROJECT: ''
      flags:
      - RUN_IN_BACKGROUND
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts: []
      name: logging
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        #!/bin/bash

        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        mkdir -m 777 -p "/mnt/data/script" 
        mkdir -m 777 -p "/mnt/data/tmp" 
        mkdir -m 777 -p "/mnt/data/workingdir" 

        echo "${_SCRIPT_REPR}"     | python -c 'import ast
        import sys

        sys.stdout.write(ast.literal_eval(sys.stdin.read()))
        '     > /mnt/data/script/python
        chmod a+x /mnt/data/script/python

        for ((i=0; i < DIR_COUNT; i++)); do
          DIR_VAR="DIR_${i}"

          log_info "mkdir -m 777 -p "${!DIR_VAR}""
          mkdir -m 777 -p "${!DIR_VAR}"
        done

      entrypoint: /bin/bash
      environment:
        DIR_0: /mnt/data/input/gs/haddath/sgosai/hff/data
        DIR_1: /mnt/data/output/gs/haddath/sgosai/hff/data
        DIR_COUNT: '2'
        _META_YAML_REPR: "\"create-time: 2019-07-25 02:16:37.189610-04:00\\ndsub-version:\
          \ v0-3-2\\njob-id: python--sagergosai--190725-021637-18\\njob-name: python\\\
          nlogging: gs://haddath/sgosai/hff/logs/\\nscript-name: python\\ntask-ids:\
          \ 1-20\\ntasks:\\n- envs:\\n    CHUNK: '4'\\n  inputs:\\n    INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt\\\
          n  logging-path: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1.log\\\
          n  outputs:\\n    OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__4_20.bed\\\
          n  task-attempt: 1\\n  task-id: '5'\\nuser-id: sagergosai\\n\""
        _SCRIPT_REPR: "'#!/usr/bin/env bash\\npython /app/hcr-ff/call_peaks.py ${INFILE}\
          \ ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100'"
      flags: []
      imageUri: python:2.7-slim
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: prepare
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < INPUT_COUNT; i++)); do
          INPUT_VAR="INPUT_${i}"
          INPUT_RECURSIVE="INPUT_RECURSIVE_${i}"
          INPUT_SRC="INPUT_SRC_${i}"
          INPUT_DST="INPUT_DST_${i}"

          log_info "Localizing ${!INPUT_VAR}"
          if [[ "${!INPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!INPUT_SRC}" "${!INPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!INPUT_SRC}" "${!INPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      entrypoint: /bin/bash
      environment:
        INPUT_0: INFILE
        INPUT_COUNT: '1'
        INPUT_DST_0: /mnt/data/input/gs/haddath/sgosai/hff/data/FADS1_rep8detailed.txt
        INPUT_RECURSIVE_0: '0'
        INPUT_SRC_0: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
        USER_PROJECT: ''
      flags: []
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: localization
      pidNamespace: ''
      portMappings: {}
    - commands:
      - bash
      - -c
      - |
        export TMPDIR="/mnt/data/tmp"
        cd /mnt/data/workingdir

        /mnt/data/script/python
      entrypoint: /usr/bin/env
      environment:
        CHUNK: '4'
        INFILE: /mnt/data/input/gs/haddath/sgosai/hff/data/FADS1_rep8detailed.txt
        OUTFILE: /mnt/data/output/gs/haddath/sgosai/hff/data/FADS1_rep8__4_20.bed
      flags: []
      imageUri: sjgosai/hff-kit:0.1.2
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: user-command
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < OUTPUT_COUNT; i++)); do
          OUTPUT_VAR="OUTPUT_${i}"
          OUTPUT_RECURSIVE="OUTPUT_RECURSIVE_${i}"
          OUTPUT_SRC="OUTPUT_SRC_${i}"
          OUTPUT_DST="OUTPUT_DST_${i}"

          log_info "Delocalizing ${!OUTPUT_VAR}"
          if [[ "${!OUTPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      entrypoint: /bin/bash
      environment:
        OUTPUT_0: OUTFILE
        OUTPUT_COUNT: '1'
        OUTPUT_DST_0: gs://haddath/sgosai/hff/data/FADS1_rep8__4_20.bed
        OUTPUT_RECURSIVE_0: '0'
        OUTPUT_SRC_0: /mnt/data/output/gs/haddath/sgosai/hff/data/FADS1_rep8__4_20.bed
        USER_PROJECT: ''
      flags: []
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: delocalization
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        mkdir -p /tmp/logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"

      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1.log
        STDERR_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1-stderr.log
        STDOUT_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190725-021637-18.5.1-stdout.log
        USER_PROJECT: ''
      flags:
      - ALWAYS_RUN
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts: []
      name: final_logging
      pidNamespace: ''
      portMappings: {}
    environment: {}
    resources:
      projectId: sabeti-encode
      regions: []
      virtualMachine:
        accelerators: []
        bootDiskSizeGb: 250
        bootImage: projects/cos-cloud/global/images/family/cos-stable
        cpuPlatform: ''
        disks:
        - name: datadisk
          sizeGb: 200
          sourceImage: ''
          type: pd-standard
        enableStackdriverMonitoring: false
        labels:
          dsub-version: v0-3-2
          goog-pipelines-worker: 'true'
          job-id: python--sagergosai--190725-021637-18
          job-name: python
          task-attempt: '1'
          task-id: '5'
          user-id: sagergosai
        machineType: n1-standard-8
        network:
          name: ''
          subnetwork: ''
          usePrivateAddress: false
        nvidiaDriverVersion: ''
        preemptible: true
        serviceAccount:
          email: default
          scopes:
          - https://www.googleapis.com/auth/bigquery
          - https://www.googleapis.com/auth/compute
          - https://www.googleapis.com/auth/devstorage.full_control
          - https://www.googleapis.com/auth/genomics
          - https://www.googleapis.com/auth/logging.write
          - https://www.googleapis.com/auth/monitoring.write
      zones:
      - us-central1-a
      - us-central1-b
      - us-central1-c
      - us-central1-f
      - us-east1-b
      - us-east1-c
      - us-east1-d
      - us-east4-a
      - us-east4-b
      - us-east4-c
      - us-west1-a
      - us-west1-b
      - us-west1-c
      - us-west2-a
      - us-west2-b
      - us-west2-c
    timeout: 14400s
  startTime: '2019-07-25T06:16:42.182994Z'
name: projects/sabeti-encode/operations/8834123416523977731

@wnojopra
Copy link
Contributor

Hi @sjgosai , the error with code 10 and 'The assigned worker has failed to complete the operation' usually indicates a preemption, or possibly some other kind of transient error. Do you see the same error with non-preemptible machines? My recommendation would be to retry the job, or to run it on non-preemptible machines.

@sjgosai
Copy link
Author

sjgosai commented Jul 29, 2019

Thanks for poking around @wnojopra . I'm pretty sure I've done some tests with non-preemptibles, but let me do it once more and save the logs.

@mbookman
Copy link
Contributor

The problem with Pipelines API error code 10 has been impacting workflows run by Cromwell as well. The Terra team has published an article about this, which is worth a quick read:

https://support.terra.bio/hc/en-us/community/posts/360046714292-Are-you-experiencing-PAPI-error-code-10-Read-this-for-a-workaround-

For dsub, you should be able to use the --retries flag to mitigate the problem with error code 10. See:

https://github.com/DataBiosphere/dsub/blob/master/docs/retries.md

@sjgosai
Copy link
Author

sjgosai commented Jul 29, 2019

It looks like I need to use --retries with non-preemptible VMs to get 100% completion in my test case. Thank you @wnojopra and @mbookman for your help. I've run it a few times with this setting and it looks like 3/20 tasks (give or take) fail the first attempt for some reason not reflected in the log files. It seems random which of the 20 tasks fails to transfer on the first attempt. Is 15% delocalization failure in the expected range, or is there any troubleshooting I can do to get that percentage down?

@wnojopra
Copy link
Contributor

I would expect this specific error code 10 to only happen on preemptible machines. If you're using a non-preemptible machine, I would expect your error rate to be extremely low. If you're seeing a high rate of transient errors with non-preemptible machines, do please share your logs with us. Also keep in mind that the issue isn't specific to delocalization. Preemptible machines can be preempted at any point of the run (though most happen near the start).

@sjgosai
Copy link
Author

sjgosai commented Jul 29, 2019

Here is some info on my last test. I used non-preemptible VMs but I did use --retries. I had three tasks fail, then complete on the rerun. FYI: My script has a progress bar that doesn't print nicely to the log files. Here are the head and tail of one of the fails:

2019-07-29 21:32:21 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/output gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1.log
2019-07-29 21:32:22 INFO: mkdir -m 777 -p /mnt/data/input/gs/haddath/sgosai/hff/data
2019-07-29 21:32:22 INFO: mkdir -m 777 -p /mnt/data/output/gs/haddath/sgosai/hff/data
2019-07-29 21:32:23 INFO: Localizing INFILE
2019-07-29 21:32:23 INFO: gsutil   -mq cp gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt /mnt/data/input/gs/haddath/sgosai/hff/data/FADS1_rep8detailed.txt
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
2019-07-29 21:33:22 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/stderr gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1-stderr.log
2019-07-29 21:33:22 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/output gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1.log
2019-07-29 21:33:22 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/stdout gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1-stdout.log
Multiprocess sampling (8 chains in 8 jobs)
NUTS: [enhancer_activity, guide_intensity]
Starting wnd_771

Sampling 8 chains:   0%|          | 0/16000 [00:00<?, ?draws/s]
Sampling 8 chains:   1%|          | 173/16000 [00:00<00:09, 1726.86draws/s]
Sampling 8 chains:   2%|â–�         | 327/16000 [00:00<00:09, 1665.76draws/s]
...
Sampling 8 chains:  94%|█████████�| 15111/16000 [00:06<00:00, 2841.80draws/s]
Sampling 8 chains:  96%|█████████▌| 15399/16000 [00:06<00:00, 2267.83draws/s]
Sampling 8 chains:  98%|█████████▊| 15647/16000 [00:06<00:00, 2076.39draws/s]
Sampling 8 chains:  99%|█████████▉| 15873/16000 [00:06<00:00, 1736.07draws/s]
Sampling 8 chains: 100%|██████████| 16000/16000 [00:06<00:00, 2378.76draws/s]
Done.

And then on the re-try:

2019-07-29 22:08:44 INFO: mkdir -m 777 -p /mnt/data/input/gs/haddath/sgosai/hff/data
2019-07-29 22:08:44 INFO: mkdir -m 777 -p /mnt/data/output/gs/haddath/sgosai/hff/data
2019-07-29 22:08:45 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/output gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.2.log
2019-07-29 22:08:45 INFO: Localizing INFILE
2019-07-29 22:08:45 INFO: gsutil   -mq cp gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt /mnt/data/input/gs/haddath/sgosai/hff/data/FADS1_rep8detailed.txt
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
2019-07-29 22:09:46 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/stdout gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.2-stdout.log
2019-07-29 22:09:46 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/stderr gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.2-stderr.log
2019-07-29 22:09:46 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/output gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.2.log
Multiprocess sampling (8 chains in 8 jobs)
NUTS: [enhancer_activity, guide_intensity]
Starting wnd_771

Sampling 8 chains:   0%|          | 0/16000 [00:00<?, ?draws/s]
Sampling 8 chains:   1%|          | 163/16000 [00:00<00:09, 1621.95draws/s]
Sampling 8 chains:   2%|â–�         | 321/16000 [00:00<00:09, 1605.92draws/s]
...
Sampling 8 chains:  99%|█████████▉| 15818/16000 [00:07<00:00, 847.31draws/s] 
Sampling 8 chains:  99%|█████████▉| 15915/16000 [00:08<00:00, 755.30draws/s]
Sampling 8 chains: 100%|██████████| 16000/16000 [00:08<00:00, 1943.18draws/s]
Done.
2019-07-29 22:22:53 INFO: Delocalizing OUTFILE
2019-07-29 22:22:53 INFO: gsutil   -mq cp /mnt/data/output/gs/haddath/sgosai/hff/data/FADS1_rep8__14_20.bed gs://haddath/sgosai/hff/data/FADS1_rep8__14_20.bed

The log from the first attempt simply stops after "Done" (printing "Done" to stderr is the absolute last action of the script).

@sjgosai
Copy link
Author

sjgosai commented Jul 29, 2019

Here are the dstat outputs for the two attempts.
First:

- create-time: '2019-07-29 17:30:25.995427'
  dsub-version: v0-3-2
  end-time: '2019-07-29 18:06:28.476723'
  envs:
    CHUNK: '14'
  events:
  - name: start
    start-time: 2019-07-29 21:30:29.451587+00:00
  - name: pulling-image
    start-time: 2019-07-29 21:31:19.931151+00:00
  - name: localizing-files
    start-time: 2019-07-29 21:32:23.005887+00:00
  - name: running-docker
    start-time: 2019-07-29 21:32:25.404431+00:00
  - name: The assigned worker has failed to complete the operation
    start-time: 2019-07-29 22:06:27.745789+00:00
  input-recursives: {}
  inputs:
    INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
  internal-id: projects/sabeti-encode/operations/16463097815993481630
  job-id: python--sagergosai--190729-173020-29
  job-name: python
  labels: {}
  last-update: '2019-07-29 18:06:28.476723'
  logging: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1.log
  mounts: {}
  output-recursives: {}
  outputs:
      OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__14_20.bed
  provider: google-v2
  provider-attributes:
    accelerators: []
    boot-disk-size: 250
    cpu_platform: ''
    disk-size: 200
    disk-type: pd-standard
    enable-stackdriver-monitoring: false
    instance-name: google-pipelines-worker-3facb0b2a5a1721eb947a95f682395e1
    machine-type: n1-standard-8
    network: ''
    preemptible: false
    regions: []
    service-account: default
    subnetwork: ''
    use_private_address: false
    zone: us-east4-b
    zones:
    - us-central1-a
    - us-central1-b
    - us-central1-c
    - us-central1-f
    - us-east1-b
    - us-east1-c
    - us-east1-d
    - us-east4-a
    - us-east4-b
    - us-east4-c
    - us-west1-a
    - us-west1-b
    - us-west1-c
    - us-west2-a
    - us-west2-b
    - us-west2-c
  script: |-
    #!/usr/bin/env bash
    python /app/hcr-ff/call_peaks.py ${INFILE} ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100
  script-name: python
  start-time: '2019-07-29 17:30:29.451587'
  status: FAILURE
  status-detail: The assigned worker has failed to complete the operation
  status-message: The assigned worker has failed to complete the operation
  task-attempt: 1
  task-id: '15'
  user-id: sagergosai

Second:

- create-time: '2019-07-29 18:06:38.730417'
  dsub-version: v0-3-2
  end-time: '2019-07-29 18:22:59.300195'
  envs:
    CHUNK: '14'
  events:
  - name: start
    start-time: 2019-07-29 22:06:39.752933+00:00
  - name: pulling-image
    start-time: 2019-07-29 22:07:29.611551+00:00
  - name: localizing-files
    start-time: 2019-07-29 22:08:44.990479+00:00
  - name: running-docker
    start-time: 2019-07-29 22:08:47.492463+00:00
  - name: delocalizing-files
    start-time: 2019-07-29 22:22:53.713359+00:00
  - name: ok
    start-time: 2019-07-29 22:22:59.300195+00:00
  input-recursives: {}
  inputs:
    INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
  internal-id: projects/sabeti-encode/operations/17904354467014874191
  job-id: python--sagergosai--190729-173020-29
  job-name: python
  labels: {}
  last-update: '2019-07-29 18:22:59.300195'
  logging: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.2.log
  mounts: {}
    output-recursives: {}
  outputs:
    OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__14_20.bed
  provider: google-v2
  provider-attributes:
    accelerators: []
    boot-disk-size: 250
    cpu_platform: ''
    disk-size: 200
    disk-type: pd-standard
    enable-stackdriver-monitoring: false
    instance-name: google-pipelines-worker-1450ce09faa50ad26f1aed67591faf26
    machine-type: n1-standard-8
    network: ''
    preemptible: false
    regions: []
    service-account: default
    subnetwork: ''
    use_private_address: false
    zone: us-west2-b
    zones:
    - us-central1-a
    - us-central1-b
    - us-central1-c
    - us-central1-f
    - us-east1-b
    - us-east1-c
    - us-east1-d
    - us-east4-a
    - us-east4-b
    - us-east4-c
    - us-west1-a
    - us-west1-b
    - us-west1-c
    - us-west2-a
    - us-west2-b
    - us-west2-c
  script: |-
    #!/usr/bin/env bash
    python /app/hcr-ff/call_peaks.py ${INFILE} ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100
  script-name: python
  start-time: '2019-07-29 18:06:39.752933'
  status: SUCCESS
  status-detail: Success
  status-message: Success
  task-attempt: 2
  task-id: '15'
  user-id: sagergosai

@sjgosai
Copy link
Author

sjgosai commented Jul 29, 2019

Finally, here's the gcloud alpha genomics operations describe <id> output for the failed attempt.

done: true
error:
  code: 10
  message: The assigned worker has failed to complete the operation
metadata:
  '@type': type.googleapis.com/google.genomics.v2alpha1.Metadata
  createTime: '2019-07-29T21:30:25.995427Z'
  endTime: '2019-07-29T22:06:28.476723Z'
  events:
  - description: Worker released
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.WorkerReleasedEvent
      instance: google-pipelines-worker-3facb0b2a5a1721eb947a95f682395e1
      zone: us-east4-b
    timestamp: '2019-07-29T22:06:28.476723Z'
  - description: The assigned worker has failed to complete the operation
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.FailedEvent
      cause: The assigned worker has failed to complete the operation
      code: ABORTED
    timestamp: '2019-07-29T22:06:27.745789Z'
  - description: Started running "user-command"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 4
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-29T21:32:25.404431Z'
  - description: Stopped running "localization"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStoppedEvent
      actionId: 3
      exitStatus: 0
      stderr: ''
    timestamp: '2019-07-29T21:32:25.266943Z'
  - description: Started running "localization"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 3
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-29T21:32:23.005887Z'
  - description: Stopped running "prepare"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStoppedEvent
      actionId: 2
      exitStatus: 0
      stderr: ''
    timestamp: '2019-07-29T21:32:22.866079Z'
  - description: Started running "prepare"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 2
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-29T21:32:20.818175Z'
  - description: Started running "logging"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.ContainerStartedEvent
      actionId: 1
      ipAddress: ''
      portMappings: {}
    timestamp: '2019-07-29T21:32:20.656207Z'
  - description: Stopped pulling "sjgosai/hff-kit:0.1.3"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent
      imageUri: sjgosai/hff-kit:0.1.3
    timestamp: '2019-07-29T21:32:09.861807Z'
  - description: Started pulling "sjgosai/hff-kit:0.1.3"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent
      imageUri: sjgosai/hff-kit:0.1.3
    timestamp: '2019-07-29T21:31:19.931151Z'
  - description: Stopped pulling "python:2.7-slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent
      imageUri: python:2.7-slim
    timestamp: '2019-07-29T21:31:19.790479Z'
  - description: Started pulling "python:2.7-slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent
      imageUri: python:2.7-slim
    timestamp: '2019-07-29T21:31:17.115855Z'
  - description: Stopped pulling "google/cloud-sdk:slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStoppedEvent
      imageUri: google/cloud-sdk:slim
    timestamp: '2019-07-29T21:31:16.981039Z'
  - description: Started pulling "google/cloud-sdk:slim"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.PullStartedEvent
      imageUri: google/cloud-sdk:slim
    timestamp: '2019-07-29T21:31:04.133055Z'
  - description: Worker "google-pipelines-worker-3facb0b2a5a1721eb947a95f682395e1"
      assigned in "us-east4-b"
    details:
      '@type': type.googleapis.com/google.genomics.v2alpha1.WorkerAssignedEvent
      instance: google-pipelines-worker-3facb0b2a5a1721eb947a95f682395e1
      machineType: n1-standard-8
      zone: us-east4-b
    timestamp: '2019-07-29T21:30:29.451587Z'
  labels:
    dsub-version: v0-3-2
    job-id: python--sagergosai--190729-173020-29
    job-name: python
    task-attempt: '1'
    task-id: '15'
    user-id: sagergosai
  pipeline:
    actions:
    - commands:
      - -c
      - |
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        while [[ ! -d /google/logs/action/6 ]]; do
          mkdir -p /tmp/continuous_logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/continuous_logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/continuous_logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"


          sleep 60s
        done
      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1.log
        STDERR_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1-stderr.log
        STDOUT_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1-stdout.log
        USER_PROJECT: ''
      flags:
      - RUN_IN_BACKGROUND
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts: []
      name: logging
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        #!/bin/bash

        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        mkdir -m 777 -p "/mnt/data/script" 
        mkdir -m 777 -p "/mnt/data/tmp" 
        mkdir -m 777 -p "/mnt/data/workingdir" 

        echo "${_SCRIPT_REPR}"     | python -c 'import ast
        import sys

        sys.stdout.write(ast.literal_eval(sys.stdin.read()))
        '     > /mnt/data/script/python
        chmod a+x /mnt/data/script/python

        for ((i=0; i < DIR_COUNT; i++)); do
          DIR_VAR="DIR_${i}"

          log_info "mkdir -m 777 -p "${!DIR_VAR}""
          mkdir -m 777 -p "${!DIR_VAR}"
        done

      entrypoint: /bin/bash
      environment:
        DIR_0: /mnt/data/input/gs/haddath/sgosai/hff/data
        DIR_1: /mnt/data/output/gs/haddath/sgosai/hff/data
        DIR_COUNT: '2'
        _META_YAML_REPR: "\"create-time: 2019-07-29 17:30:20.299204-04:00\\ndsub-version:\
          \ v0-3-2\\njob-id: python--sagergosai--190729-173020-29\\njob-name: python\\\
          nlogging: gs://haddath/sgosai/hff/logs/\\nscript-name: python\\ntask-ids:\
          \ 1-20\\ntasks:\\n- envs:\\n    CHUNK: '14'\\n  inputs:\\n    INFILE: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt\\\
          n  logging-path: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1.log\\\
          n  outputs:\\n    OUTFILE: gs://haddath/sgosai/hff/data/FADS1_rep8__14_20.bed\\\
          n  task-attempt: 1\\n  task-id: '15'\\nuser-id: sagergosai\\n\""
        _SCRIPT_REPR: "'#!/usr/bin/env bash\\npython /app/hcr-ff/call_peaks.py ${INFILE}\
          \ ${OUTFILE} -ji ${CHUNK} -jr 20 -ws 100 -ss 100'"
      flags: []
      imageUri: python:2.7-slim
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: prepare
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < INPUT_COUNT; i++)); do
          INPUT_VAR="INPUT_${i}"
          INPUT_RECURSIVE="INPUT_RECURSIVE_${i}"
          INPUT_SRC="INPUT_SRC_${i}"
          INPUT_DST="INPUT_DST_${i}"

          log_info "Localizing ${!INPUT_VAR}"
          if [[ "${!INPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!INPUT_SRC}" "${!INPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!INPUT_SRC}" "${!INPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      entrypoint: /bin/bash
      environment:
        INPUT_0: INFILE
        INPUT_COUNT: '1'
        INPUT_DST_0: /mnt/data/input/gs/haddath/sgosai/hff/data/FADS1_rep8detailed.txt
        INPUT_RECURSIVE_0: '0'
        INPUT_SRC_0: gs://haddath/sgosai/hff/data/FADS1_rep8detailed.txt
        USER_PROJECT: ''
      flags: []
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: localization
      pidNamespace: ''
      portMappings: {}
    - commands:
      - bash
      - -c
      - |
        export TMPDIR="/mnt/data/tmp"
        cd /mnt/data/workingdir

        /mnt/data/script/python
      entrypoint: /usr/bin/env
      environment:
        CHUNK: '14'
        INFILE: /mnt/data/input/gs/haddath/sgosai/hff/data/FADS1_rep8detailed.txt
        OUTFILE: /mnt/data/output/gs/haddath/sgosai/hff/data/FADS1_rep8__14_20.bed
      flags: []
      imageUri: sjgosai/hff-kit:0.1.3
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: user-command
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_rsync() {
          local src="${1}"
          local dst="${2}"
          local user_project_name="${3}"

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
            if gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${user_project_flag} -mq rsync -r "${src}" "${dst}""
          exit 1
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }


        set -o errexit
        set -o nounset
        set -o pipefail

        for ((i=0; i < OUTPUT_COUNT; i++)); do
          OUTPUT_VAR="OUTPUT_${i}"
          OUTPUT_RECURSIVE="OUTPUT_RECURSIVE_${i}"
          OUTPUT_SRC="OUTPUT_SRC_${i}"
          OUTPUT_DST="OUTPUT_DST_${i}"

          log_info "Delocalizing ${!OUTPUT_VAR}"
          if [[ "${!OUTPUT_RECURSIVE}" -eq "1" ]]; then
            gsutil_rsync "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "${USER_PROJECT}"
          else
            gsutil_cp "${!OUTPUT_SRC}" "${!OUTPUT_DST}" "" "${USER_PROJECT}"
          fi
        done

      entrypoint: /bin/bash
      environment:
        OUTPUT_0: OUTFILE
        OUTPUT_COUNT: '1'
        OUTPUT_DST_0: gs://haddath/sgosai/hff/data/FADS1_rep8__14_20.bed
        OUTPUT_RECURSIVE_0: '0'
        OUTPUT_SRC_0: /mnt/data/output/gs/haddath/sgosai/hff/data/FADS1_rep8__14_20.bed
        USER_PROJECT: ''
      flags: []
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts:
      - disk: datadisk
        path: /mnt/data
        readOnly: false
      name: delocalization
      pidNamespace: ''
      portMappings: {}
    - commands:
      - -c
      - |+
        set -o errexit
        set -o nounset
        set -o pipefail

        function get_datestamp() {
          date "+%Y-%m-%d %H:%M:%S"
        }

        function log_info() {
          echo "$(get_datestamp) INFO: $@"
        }

        function log_error() {
          echo "$(get_datestamp) ERROR: $@"
        }

        function gsutil_cp() {
          local src="${1}"
          local dst="${2}"
          local content_type="${3}"
          local user_project_name="${4}"

          local headers=""
          if [[ -n "${content_type}" ]]; then
            headers="-h Content-Type:${content_type}"
          fi

          local user_project_flag=""
          if [[ -n "${user_project_name}" ]]; then
            user_project_flag="-u ${user_project_name}"
          fi

          local i
          for ((i = 0; i < 3; i++)); do
            log_info "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
            if gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}"; then
              return
            fi
          done

          2>&1 log_error "gsutil ${headers} ${user_project_flag} -mq cp "${src}" "${dst}""
          exit 1
        }

        function log_cp() {
          local src="${1}"
          local dst="${2}"
          local tmp="${3}"
          local check_src="${4}"
          local user_project_name="${5}"

          if [[ "${check_src}" == "true" ]] && [[ ! -e "${src}" ]]; then
            return
          fi

          # Copy the log files to a local temporary location so that our "gsutil cp" is never
          # executed on a file that is changing.

          local tmp_path="${tmp}/$(basename ${src})"
          cp "${src}" "${tmp_path}"

          gsutil_cp "${tmp_path}" "${dst}" "text/plain" "${user_project_name}"
        }

        mkdir -p /tmp/logging_action
        log_cp /google/logs/action/4/stdout "${STDOUT_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDOUT_PID=$!
        log_cp /google/logs/action/4/stderr "${STDERR_PATH}" /tmp/logging_action "true" "${USER_PROJECT}" &
        STDERR_PID=$!

        log_cp /google/logs/output "${LOGGING_PATH}" /tmp/logging_action "false" "${USER_PROJECT}" &
        LOG_PID=$!

        wait "${STDOUT_PID}"
        wait "${STDERR_PID}"
        wait "${LOG_PID}"

      entrypoint: /bin/bash
      environment:
        LOGGING_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1.log
        STDERR_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1-stderr.log
        STDOUT_PATH: gs://haddath/sgosai/hff/logs/python--sagergosai--190729-173020-29.15.1-stdout.log
        USER_PROJECT: ''
      flags:
      - ALWAYS_RUN
      imageUri: google/cloud-sdk:slim
      labels: {}
      mounts: []
      name: final_logging
      pidNamespace: ''
      portMappings: {}
    environment: {}
    resources:
      projectId: sabeti-encode
      regions: []
      virtualMachine:
        accelerators: []
        bootDiskSizeGb: 250
        bootImage: projects/cos-cloud/global/images/family/cos-stable
        cpuPlatform: ''
        disks:
        - name: datadisk
          sizeGb: 200
          sourceImage: ''
          type: pd-standard
        enableStackdriverMonitoring: false
        labels:
          dsub-version: v0-3-2
          goog-pipelines-worker: 'true'
          job-id: python--sagergosai--190729-173020-29
          job-name: python
          task-attempt: '1'
          task-id: '15'
          user-id: sagergosai
        machineType: n1-standard-8
        network:
          name: ''
          subnetwork: ''
          usePrivateAddress: false
        nvidiaDriverVersion: ''
        preemptible: false
        serviceAccount:
          email: default
          scopes:
          - https://www.googleapis.com/auth/bigquery
          - https://www.googleapis.com/auth/compute
          - https://www.googleapis.com/auth/devstorage.full_control
          - https://www.googleapis.com/auth/genomics
          - https://www.googleapis.com/auth/logging.write
          - https://www.googleapis.com/auth/monitoring.write
      zones:
      - us-central1-a
      - us-central1-b
      - us-central1-c
      - us-central1-f
      - us-east1-b
      - us-east1-c
      - us-east1-d
      - us-east4-a
      - us-east4-b
      - us-east4-c
      - us-west1-a
      - us-west1-b
      - us-west1-c
      - us-west2-a
      - us-west2-b
      - us-west2-c
    timeout: 14400s
  startTime: '2019-07-29T21:30:29.451587Z'
name: projects/sabeti-encode/operations/16463097815993481630

@mbookman
Copy link
Contributor

Hi @sjgosai !

For this last example, there is no indication that delocalization had actually started. It looks like "Worker" on the node failed to check in prior to delocalization. Even though your user-command did finish, it is very possible that you were right on the edge of being out of memory and the OOM killer may have killed the Pipelines API Worker.

Our first recommendation here is to increase the amount of memory available on the VM. See if that makes a difference in the success-without-retry rate.

@sjgosai
Copy link
Author

sjgosai commented Jul 30, 2019

Thanks @mbookman it seems like that's working.

I'm just a bit surprised because I had originally tested my code with my test data on a n1-highcpu-8 VM and had assumed a n1-standard-8 would give me enough wiggle. The n1-highmem-8 worked perfectly though, so it must've been memory related. I'm going to test a few more times for completeness, but I think this probably fixed it.

Thanks for your help!

Edit:
I lied I didn't test on n1-highcpu-8 it was a custom with 10 cores and 20 GB ram.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants