Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707

Closed
iameskild opened this issue Jun 29, 2021 · 8 comments
Assignees
Labels
type: enhancement 💅🏼 New feature or request

Comments

@iameskild
Copy link
Member

Describe the bug

A clear and concise description of what the problem is.

We have a prefect-agent pod running and have registered a few flows with it. As we have been working to get these flows working as expected (currently also experiencing dask gateway issues), we have been running into issues with the cluster. As part of the Prefect flow testing, I will manually start a flow-run from the Prefect cloud console and this action results in many critical pods being evicted and an additional general node being created, splitting the qhub cluster pods between these two general nodes.

Here is the most accurate timeline I have yet documented:

  1. Kick off Prefect flow run from the Prefect cloud console

    • This flow was registered with the Prefect agent using a GitHub Action workflow
    • The prefect flow image is stored in ECR and is quite large at around ~9 GB
  2. A prefect-job pod spins up on the general node and after a few minutes, the pod fails and falls into a crashLoopBackOff state with the following error messages:

    │   Type     Reason     Age                  From                                                Message                                                             │
    │   ----     ------     ----                 ----                                                -------                                                             │
    │   Normal   Scheduled  9m2s                 default-scheduler                                   Successfully assigned dev/prefect-job-beae9939-znksf to ip-10-10-19 │
    │ -99.us-east-2.compute.internal                                                                                                                                     │
    │   Warning  Failed     6m32s                kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
    │ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
    │ t/conda/pkgs/kubernetes-server-1.21.2-h77c71de_0.tar.bz2: no space left on device                                                                                  │
    │   Warning  Failed     3m54s                kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
    │ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
    │ t/conda/pkgs/libclang-11.1.0-default_ha53f305_1.tar.bz2: no space left on device                                                                                   │
    │   Warning  Failed     53s                  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
    │ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
    │ t/conda/pkgs/libgdal-3.2.1-h38ff51b_7/lib/libgdal.so.28.0.1: no space left on device                                                                               │
    │   Warning  Failed     53s (x3 over 6m32s)  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Error: ErrImagePull                                                 │
    │   Normal   BackOff    25s (x4 over 6m30s)  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Back-off pulling image "434027708253.dkr.ecr.us-east-2.amazonaws.co │
    │ m/prefect-workflows/datum-flows:1624894027929062792"
    
  3. At this point, many of the other pods on the general node start being quickly evicted

    • These include the user-scheduler, qhub-traefik-ingress, qhub-jupyterhub-ssh and proxy pods
    • There appeared to be a dozen or so evicted user-scheduler pods
    • Most these pods have similar event log messages:
    |   Type     Reason               Age    From                                                Message                                                                 │
    │   ----     ------               ----   ----                                                -------                                                                 │
    │   Warning  Evicted              8m40s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  The node was low on resource: ephemeral-storage. Container user-schedul │
    │ er was using 33228Ki, which exceeds its request of 0.                                                                                                              │
    │   Normal   Killing              8m40s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Stopping container user-scheduler                                       │
    │   Warning  ExceededGracePeriod  8m30s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Container runtime did not kill the pod within specified grace period.  
    
    • Some of the user-scheduler pods have this event message:
    │   Type     Reason     Age   From                                                Message                                                                            │
    │   ----     ------     ----  ----                                                -------                                                                            │
    │   Normal   Scheduled  16m   default-scheduler                                   Successfully assigned dev/user-scheduler-57959bddf7-m9b6x to ip-10-10-19-99.us-eas │
    │ t-2.compute.internal                                                                                                                                               │
    │   Warning  Evicted    16m   kubelet, ip-10-10-19-99.us-east-2.compute.internal  The node had condition: [DiskPressure].
    
  4. Then the conda-store and hub pods get evicted and have trouble coming back online.

    • conda-store and hub pod event messages:
    │   Type     Reason             Age                 From                Message                                                                                      │
    │   ----     ------             ----                ----                -------                                                                                      │
    │   Normal   NotTriggerScaleUp  79s (x52 over 11m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match │
    │  node selector, 1 max node group size reached                                                                                                                      │
    │   Warning  FailedScheduling   29s (x11 over 11m)  default-scheduler   0/3 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {node.kub │
    │ ernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had volume node affinity conflict. 
    

After documenting all of the above, I waited a few minutes to see if the cluster would gracefully correct itself but ultimately had to manually kill all of the evicted pods and manually kill the prefect-job job. And to get back down to one general node, I drained the new general node and force killed any pods that didn't want to be evicted. This resulted in a stable cluster.

After doing a bit of digging around and reading online, our hypothesis is that the attached block storage got overwhelmed by the size of the prefect-job image and was forced to evict many of the other pods. There are currently three block-store volumes attached to the general node:

  • 100 GB for the conda-store
  • 20 GB (unknown purpose)
  • 1 GB (unknown purpose)

The next step for us is to reduce the size of the prefect-job image, however we were also wondering if we could (or should) increase the size of the block-stores attached to the general node.

How can we help?

Help us help you.

  • What are you trying to achieve?
    • Register a flow with Prefect agent and have a prefect-job spin up without causing trouble for the cluster
  • How can we reproduce the problem?
    • Yes, I'd be happy to show how I got to these logs.

Your environment

Describe the environment in which you are experiencing the bug.

Include your conda version (use conda --version), k8s and any other relevant details.

@brl0
Copy link
Contributor

brl0 commented Jun 29, 2021

So far I haven't found a particularly easy way to add storage to the nodes.

From what I have read, it looks like the way to do this through terraform would be through the use of launch templates, here is what seems to be the relevant terraform provider, which has a volume_size attribute.
The launch template can then be referenced in the node group.

Here is an example of a launch template I found while digging into this.

I am not sure about the best way to cleanly integrate something like this with qhub. One thought is to possibly create a single default launch template for all node groups, which could probably be easily overridden with a new section in the qhub-config.yaml. Another, more comprehensive approach, would be to automatically create a launch template for each node group, with appropriate typical defaults, but now allowing overrides through additional new fields in the already define section of the qhub config.

While I am not particularly pleased with the additional complexity this idea might add, I do think it would probably add a fair bit of flexibility, at least as far as AWS is concerned.

@aktech
Copy link
Member

aktech commented Jul 1, 2021

@iameskild Thanks a lot for the fully descriptive issue, this is really very helpful.

@brl0 Thanks for the analysis on this. I'll take a look at it today, my hunch is exposing the disk_size for the EKS node group resource via qhub config file, should fix this problem.

@brl0
Copy link
Contributor

brl0 commented Jul 1, 2021

Funny, I missed that, looks pretty straight forward. Thanks!

@aktech
Copy link
Member

aktech commented Jul 1, 2021

Here is an example of implementing something like that: #604

@iameskild iameskild changed the title [bug] Running a Prefect flow results in pod evictions and potentially a bad state [enhancement] Running a Prefect flow results in pod evictions and potentially a bad state Jul 2, 2021
@tarundmsharma tarundmsharma added the type: enhancement 💅🏼 New feature or request label Jul 27, 2021
@viniciusdc
Copy link
Contributor

viniciusdc commented Jul 27, 2021

As @aktech commented on #719, per dask-gateway:409 discussions, we're waiting for the next release of dask-gateway to close this.

@brl0
Copy link
Contributor

brl0 commented Jul 27, 2021

Actually, this issue is different and was due to disk space on the nodes not being large enough, which looks like it was fixed today in #740, so this should be ok to close.

@viniciusdc
Copy link
Contributor

Actually, this issue is different and was due to disk space on the nodes not being large enough, which looks like it was fixed today in #740, so this should be ok to close.

Oh Thanks @brl0. @iameskild have you encountered this issue again since the update from #740 ?

@iameskild
Copy link
Member Author

@viniciusdc we have redeployed a few times and, today, I ran several prefect flows without any issue. Thank you following up!
Close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement 💅🏼 New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants