[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707

iameskild · 2021-06-29T16:58:52Z

Describe the bug

A clear and concise description of what the problem is.

We have a prefect-agent pod running and have registered a few flows with it. As we have been working to get these flows working as expected (currently also experiencing dask gateway issues), we have been running into issues with the cluster. As part of the Prefect flow testing, I will manually start a flow-run from the Prefect cloud console and this action results in many critical pods being evicted and an additional general node being created, splitting the qhub cluster pods between these two general nodes.

Here is the most accurate timeline I have yet documented:

Kick off Prefect flow run from the Prefect cloud console
- This flow was registered with the Prefect agent using a GitHub Action workflow
- The prefect flow image is stored in ECR and is quite large at around ~9 GB

A prefect-job pod spins up on the general node and after a few minutes, the pod fails and falls into a crashLoopBackOff state with the following error messages:

│   Type     Reason     Age                  From                                                Message                                                             │
│   ----     ------     ----                 ----                                                -------                                                             │
│   Normal   Scheduled  9m2s                 default-scheduler                                   Successfully assigned dev/prefect-job-beae9939-znksf to ip-10-10-19 │
│ -99.us-east-2.compute.internal                                                                                                                                     │
│   Warning  Failed     6m32s                kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
│ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
│ t/conda/pkgs/kubernetes-server-1.21.2-h77c71de_0.tar.bz2: no space left on device                                                                                  │
│   Warning  Failed     3m54s                kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
│ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
│ t/conda/pkgs/libclang-11.1.0-default_ha53f305_1.tar.bz2: no space left on device                                                                                   │
│   Warning  Failed     53s                  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
│ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
│ t/conda/pkgs/libgdal-3.2.1-h38ff51b_7/lib/libgdal.so.28.0.1: no space left on device                                                                               │
│   Warning  Failed     53s (x3 over 6m32s)  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Error: ErrImagePull                                                 │
│   Normal   BackOff    25s (x4 over 6m30s)  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Back-off pulling image "434027708253.dkr.ecr.us-east-2.amazonaws.co │
│ m/prefect-workflows/datum-flows:1624894027929062792"

At this point, many of the other pods on the general node start being quickly evicted

These include the user-scheduler, qhub-traefik-ingress, qhub-jupyterhub-ssh and proxy pods
There appeared to be a dozen or so evicted user-scheduler pods
Most these pods have similar event log messages:

|   Type     Reason               Age    From                                                Message                                                                 │
│   ----     ------               ----   ----                                                -------                                                                 │
│   Warning  Evicted              8m40s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  The node was low on resource: ephemeral-storage. Container user-schedul │
│ er was using 33228Ki, which exceeds its request of 0.                                                                                                              │
│   Normal   Killing              8m40s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Stopping container user-scheduler                                       │
│   Warning  ExceededGracePeriod  8m30s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Container runtime did not kill the pod within specified grace period.

Some of the user-scheduler pods have this event message:

│   Type     Reason     Age   From                                                Message                                                                            │
│   ----     ------     ----  ----                                                -------                                                                            │
│   Normal   Scheduled  16m   default-scheduler                                   Successfully assigned dev/user-scheduler-57959bddf7-m9b6x to ip-10-10-19-99.us-eas │
│ t-2.compute.internal                                                                                                                                               │
│   Warning  Evicted    16m   kubelet, ip-10-10-19-99.us-east-2.compute.internal  The node had condition: [DiskPressure].

Then the conda-store and hub pods get evicted and have trouble coming back online.

conda-store and hub pod event messages:

│   Type     Reason             Age                 From                Message                                                                                      │
│   ----     ------             ----                ----                -------                                                                                      │
│   Normal   NotTriggerScaleUp  79s (x52 over 11m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match │
│  node selector, 1 max node group size reached                                                                                                                      │
│   Warning  FailedScheduling   29s (x11 over 11m)  default-scheduler   0/3 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {node.kub │
│ ernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had volume node affinity conflict.

After documenting all of the above, I waited a few minutes to see if the cluster would gracefully correct itself but ultimately had to manually kill all of the evicted pods and manually kill the prefect-job job. And to get back down to one general node, I drained the new general node and force killed any pods that didn't want to be evicted. This resulted in a stable cluster.

After doing a bit of digging around and reading online, our hypothesis is that the attached block storage got overwhelmed by the size of the prefect-job image and was forced to evict many of the other pods. There are currently three block-store volumes attached to the general node:

100 GB for the conda-store
20 GB (unknown purpose)
1 GB (unknown purpose)

The next step for us is to reduce the size of the prefect-job image, however we were also wondering if we could (or should) increase the size of the block-stores attached to the general node.

How can we help?

Help us help you.

What are you trying to achieve?
- Register a flow with Prefect agent and have a prefect-job spin up without causing trouble for the cluster
How can we reproduce the problem?
- Yes, I'd be happy to show how I got to these logs.

Your environment

Describe the environment in which you are experiencing the bug.

Include your conda version (use conda --version), k8s and any other relevant details.

The text was updated successfully, but these errors were encountered:

brl0 · 2021-06-29T20:50:15Z

So far I haven't found a particularly easy way to add storage to the nodes.

From what I have read, it looks like the way to do this through terraform would be through the use of launch templates, here is what seems to be the relevant terraform provider, which has a volume_size attribute.
The launch template can then be referenced in the node group.

Here is an example of a launch template I found while digging into this.

I am not sure about the best way to cleanly integrate something like this with qhub. One thought is to possibly create a single default launch template for all node groups, which could probably be easily overridden with a new section in the qhub-config.yaml. Another, more comprehensive approach, would be to automatically create a launch template for each node group, with appropriate typical defaults, but now allowing overrides through additional new fields in the already define section of the qhub config.

While I am not particularly pleased with the additional complexity this idea might add, I do think it would probably add a fair bit of flexibility, at least as far as AWS is concerned.

aktech · 2021-07-01T14:52:36Z

@iameskild Thanks a lot for the fully descriptive issue, this is really very helpful.

@brl0 Thanks for the analysis on this. I'll take a look at it today, my hunch is exposing the disk_size for the EKS node group resource via qhub config file, should fix this problem.

brl0 · 2021-07-01T14:54:52Z

Funny, I missed that, looks pretty straight forward. Thanks!

aktech · 2021-07-01T15:12:08Z

Here is an example of implementing something like that: #604

viniciusdc · 2021-07-27T15:07:16Z

As @aktech commented on #719, per dask-gateway:409 discussions, we're waiting for the next release of dask-gateway to close this.

brl0 · 2021-07-27T23:58:36Z

Actually, this issue is different and was due to disk space on the nodes not being large enough, which looks like it was fixed today in #740, so this should be ok to close.

viniciusdc · 2021-08-04T20:05:55Z

Actually, this issue is different and was due to disk space on the nodes not being large enough, which looks like it was fixed today in #740, so this should be ok to close.

Oh Thanks @brl0. @iameskild have you encountered this issue again since the update from #740 ?

iameskild · 2021-08-16T21:16:06Z

@viniciusdc we have redeployed a few times and, today, I ran several prefect flows without any issue. Thank you following up!
Close this issue.

tylerpotts assigned aktech Jul 1, 2021

iameskild changed the title ~~[bug] Running a Prefect flow results in pod evictions and potentially a bad state~~ [enhancement] Running a Prefect flow results in pod evictions and potentially a bad state Jul 2, 2021

iameskild mentioned this issue Jul 3, 2021

[bug] JUPYTERHUB_API_TOKEN incorrectly set for prefect pods #719

Closed

tarundmsharma added the type: enhancement 💅🏼 New feature or request label Jul 27, 2021

iameskild closed this as completed Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707

[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707

iameskild commented Jun 29, 2021

brl0 commented Jun 29, 2021

aktech commented Jul 1, 2021

brl0 commented Jul 1, 2021

aktech commented Jul 1, 2021 •

edited

Loading

viniciusdc commented Jul 27, 2021 •

edited

Loading

brl0 commented Jul 27, 2021

viniciusdc commented Aug 4, 2021

iameskild commented Aug 16, 2021

[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707

[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707

Comments

iameskild commented Jun 29, 2021

Describe the bug

How can we help?

Your environment

brl0 commented Jun 29, 2021

aktech commented Jul 1, 2021

brl0 commented Jul 1, 2021

aktech commented Jul 1, 2021 • edited Loading

viniciusdc commented Jul 27, 2021 • edited Loading

brl0 commented Jul 27, 2021

viniciusdc commented Aug 4, 2021

iameskild commented Aug 16, 2021

aktech commented Jul 1, 2021 •

edited

Loading

viniciusdc commented Jul 27, 2021 •

edited

Loading