-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state #707
Comments
So far I haven't found a particularly easy way to add storage to the nodes. From what I have read, it looks like the way to do this through terraform would be through the use of launch templates, here is what seems to be the relevant terraform provider, which has a volume_size attribute. Here is an example of a launch template I found while digging into this. I am not sure about the best way to cleanly integrate something like this with qhub. One thought is to possibly create a single default launch template for all node groups, which could probably be easily overridden with a new section in the While I am not particularly pleased with the additional complexity this idea might add, I do think it would probably add a fair bit of flexibility, at least as far as AWS is concerned. |
@iameskild Thanks a lot for the fully descriptive issue, this is really very helpful. @brl0 Thanks for the analysis on this. I'll take a look at it today, my hunch is exposing the disk_size for the EKS node group resource via qhub config file, should fix this problem. |
Funny, I missed that, looks pretty straight forward. Thanks! |
Here is an example of implementing something like that: #604 |
As @aktech commented on #719, per dask-gateway:409 discussions, we're waiting for the next release of dask-gateway to close this. |
Actually, this issue is different and was due to disk space on the nodes not being large enough, which looks like it was fixed today in #740, so this should be ok to close. |
Oh Thanks @brl0. @iameskild have you encountered this issue again since the update from #740 ? |
@viniciusdc we have redeployed a few times and, today, I ran several prefect flows without any issue. Thank you following up! |
Describe the bug
A clear and concise description of what the problem is.
We have a
prefect-agent
pod running and have registered a few flows with it. As we have been working to get these flows working as expected (currently also experiencing dask gateway issues), we have been running into issues with the cluster. As part of the Prefect flow testing, I will manually start a flow-run from the Prefect cloud console and this action results in many critical pods being evicted and an additionalgeneral
node being created, splitting the qhub cluster pods between these two general nodes.Here is the most accurate timeline I have yet documented:
Kick off Prefect flow run from the Prefect cloud console
A
prefect-job
pod spins up on thegeneral
node and after a few minutes, the pod fails and falls into acrashLoopBackOff
state with the following error messages:At this point, many of the other pods on the
general
node start being quickly evicteduser-scheduler
,qhub-traefik-ingress
,qhub-jupyterhub-ssh
andproxy
podsuser-scheduler
podsuser-scheduler
pods have this event message:Then the
conda-store
andhub
pods get evicted and have trouble coming back online.conda-store
andhub
pod event messages:After documenting all of the above, I waited a few minutes to see if the cluster would gracefully correct itself but ultimately had to manually kill all of the evicted pods and manually kill the
prefect-job
job. And to get back down to onegeneral
node, I drained the newgeneral
node and force killed any pods that didn't want to be evicted. This resulted in a stable cluster.After doing a bit of digging around and reading online, our hypothesis is that the attached block storage got overwhelmed by the size of the
prefect-job
image and was forced to evict many of the other pods. There are currently three block-store volumes attached to thegeneral
node:The next step for us is to reduce the size of the
prefect-job
image, however we were also wondering if we could (or should) increase the size of the block-stores attached to thegeneral
node.How can we help?
Help us help you.
prefect-job
spin up without causing trouble for the clusterYour environment
Describe the environment in which you are experiencing the bug.
The text was updated successfully, but these errors were encountered: