-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reth killed by out-of-memory killer #8115
Comments
I just restarted reth with jemalloc heap profiling enabled. I'll update the bug if I can trigger this again with the profiling, but it may take quite a while to reproduce. |
hey @gbrew do you have debug logs you can send over? They would be in |
Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again. |
hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then |
maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full |
I just checked, and I'm running with |
Are you thinking of a specific buffer which I can see on the dashboard? None of the ones I saw looked to be increasing in size during the memory growth FWIW. |
I just caught this crash again but with heap profiling enabled. If I'm reading correctly, it looks like it is using 32GB of memory for the Vec allocated in get_headers_response(). |
Here's the grafana jemalloc stats for the latest crash. Interesting that it has 5 episodes of linear memory growth in the 3 hours before crashing, but in all but the last of them it recovers before using up all the memory. It does look consistent with continuing to process p2p requests but not sending the responses for long periods of time, along the lines of what @mattsse was suggesting above. Seems a bit crazy this would happen for more than an hour straight though. Maybe some explicit form of backpressure is needed to prioritize sending responses? ![]() |
related #8131 |
Ran into the same issue. I didn't save logs because I figured it was a result of heavy RPC load (querying states for days). After checking my code and restarting, I can confirm it's not the RPC load (memory usage remains steady around 30GB when I start). If it happens again I'll include the metrics tag on restart and and keep track of p2p header requests. Execution: Reth v0.2.0-beta.8 |
This issue is stale because it has been open for 21 days with no activity. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
I am seeing OOM kills happen fairly frequently too. I haven't installed the monitoring stack yet but I assume it's related to p2p as I don't utilize rpc in any way.
|
do you mind setting up monitoring and/or checking what docker's memory limits are? And could you file a new issue for this? |
No limits are applied by docker (or podman in this instance). If you notice the |
Quick update: I've continued seeing an OOM crash every month or so as of v1.0.4. I haven't done any work to verify that these crashes are the same issue laid out here, but I'd guess they are. |
what commit are you running @gbrew ? |
@gbrew can we move this to a new issue? Just so that we have the full context in one place. @0xmichalis feel free to comment on the new issue as well |
Of course, I'll open a fresh issue if I see this happen again. |
This issue is stale because it has been open for 21 days with no activity. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Describe the bug
Reth 0.2.0-beta.6 running on ubuntu 22.04 was killed by the OOM killer.
It had been running at the chain tip for more than two weeks before the jemalloc stats started to increase linearly from ~3.5GB to ~40GB over less than an hour, at which point it was killed. The machine has 64GB RAM.
The machine is running reth, lighthouse, an arbitrum node which is using reth, and an RPC client program which is primarily calling eth_call, debug_traceCall and eth_subscribe (for new blocks) on reth. It's not a particularly heavy RPC load, but there may be some bursts of activity. It is primarily using IPC transport for RPC. I don't see any signs of anything bad happening right before the OOM in any of the client logs or the reth log.
Memory graphs are attached along with the relevant bit of reth.log. Happy to upload more info, but none of the other graphs on the dashboard showed any obvious issues.
reth.log
Steps to reproduce
Sync reth to ethereum chain tip
subscribe to new blocks, run debug_traceCall to get logs and state changes from mempool txns using IPC
Wait several weeks
Boom!
Node logs
No response
Platform(s)
Linux (x86)
What version/commit are you on?
reth Version: 0.2.0-beta.6
Commit SHA: ac29b4b
Build Timestamp: 2024-04-26T04:52:11.095680376Z
Build Features: jemalloc
Build Profile: maxperf
What database version are you on?
Current database version: 2
Local database version: 1
Which chain / network are you on?
mainnet
What type of node are you running?
Archive (default)
What prune config do you use, if any?
None
If you've built Reth from source, provide the full command you used
cargo build --profile maxperf --bin reth
Code of Conduct
The text was updated successfully, but these errors were encountered: