[Q&A] DiskPressure due to 80 GB /home/fluentd/core. files #4415
Replies: 4 comments 1 reply
-
We are also facing the same disk pressure issue in the EKS cluster. Attaching relevant logs from the fluentd containers and environment details.
core files from the container
EKS Version: 1.25 |
Beta Was this translation helpful? Give feedback.
-
Is it possible to identify the files occupying disk space? Buffer files, etc. |
Beta Was this translation helpful? Give feedback.
-
We just experienced the worst weekend in the history of IT (or, maybe the second worst because it's "only" the staging environment). But, the team I'm on was ready to skip the Superbowl and make it the highest priority call usually reserved for production disasters. On the weekend I chose to upgrade our EKS v1.28.5 clusters to 1.29 then 1.30, someone else on our team decided it was a good Friday to upgrade Fluentd from v1.14.5-711 to 1.16.5-80 or some shit. Well, the next 48 hours that ensued was the most brutal work experience of my life. It was worse than when I worked for Symantec and had to travel to Houston, Texas to fix a customer's SSIM issues. They sat me down in the middle of the room and locked me in until I fixed the problems. Anyway, yes, there is something very wrong with v1.16.5-80. It was filling up the /var/lib/containerd/io.containerd.snapshotter.overlayfs/snapshots directory with many hundreds of GBs of snapshots for some reason. I went from our normal 137GB volume to a 1TB volume that it filled up in about 10 hours. It wasn't until I happened to try downgrading the thing that the other person on my team installed (fluentd) and then relaunched the cluster (for the third time) with a 2TB volume for my next tests (trying to literally "buy" ourselves some more time to troubleshoot) but then it magically wasn't happening anymore on the relaunched nodes (after doing this process numerous times while fighting IP constraints because the agency I work for doesn't seem to have enough network space to do what they are doing). It was a fucking nightmare. Have y'all tested with EKS v1.29 and v1.30 yet? Please do the world a favor and test more on those versions. The issue takes a while to start happening. |
Beta Was this translation helpful? Give feedback.
-
I guess that it might be already fixed issue. fluent/fluentd-docker-image#385 If so, it was caused by old version of systemd-journal and fluent-plugin-systemd. |
Beta Was this translation helpful? Give feedback.
-
What is a problem?
Describe the configuration of Fluentd
Describe the logs of Fluentd
fluentd.log
Environment
Beta Was this translation helpful? Give feedback.
All reactions