reth killed by out-of-memory killer #8115

gbrew · 2024-05-05T21:46:55Z

Describe the bug

Reth 0.2.0-beta.6 running on ubuntu 22.04 was killed by the OOM killer.

It had been running at the chain tip for more than two weeks before the jemalloc stats started to increase linearly from ~3.5GB to ~40GB over less than an hour, at which point it was killed. The machine has 64GB RAM.

The machine is running reth, lighthouse, an arbitrum node which is using reth, and an RPC client program which is primarily calling eth_call, debug_traceCall and eth_subscribe (for new blocks) on reth. It's not a particularly heavy RPC load, but there may be some bursts of activity. It is primarily using IPC transport for RPC. I don't see any signs of anything bad happening right before the OOM in any of the client logs or the reth log.

Memory graphs are attached along with the relevant bit of reth.log. Happy to upload more info, but none of the other graphs on the dashboard showed any obvious issues.

reth.log

Steps to reproduce

Sync reth to ethereum chain tip
subscribe to new blocks, run debug_traceCall to get logs and state changes from mempool txns using IPC
Wait several weeks
Boom!

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 0.2.0-beta.6
Commit SHA: ac29b4b
Build Timestamp: 2024-04-26T04:52:11.095680376Z
Build Features: jemalloc
Build Profile: maxperf

What database version are you on?

Current database version: 2
Local database version: 1

Which chain / network are you on?

mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

None

If you've built Reth from source, provide the full command you used

cargo build --profile maxperf --bin reth

Code of Conduct

I agree to follow the Code of Conduct

The text was updated successfully, but these errors were encountered:

gbrew · 2024-05-05T22:03:13Z

I just restarted reth with jemalloc heap profiling enabled. I'll update the bug if I can trigger this again with the profiling, but it may take quite a while to reproduce.

Rjected · 2024-05-06T15:47:35Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

gbrew · 2024-05-06T16:46:41Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

Rjected · 2024-05-06T16:57:15Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then /root/.cache/reth/logs is the place to look

mattsse · 2024-05-06T17:40:28Z

maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full

gbrew · 2024-05-06T18:48:59Z

hey @gbrew do you have debug logs you can send over? They would be in ~/.cache/reth/logs

Sorry, the only logs I found there were not current, so I didn't include them. Will see if I need to change my config to get debug logs enabled again.

hmm, they should be enabled automatically, if this is running in a service of some sort, or as root? if it's running as a systemd system service or similar, then /root/.cache/reth/logs is the place to look

I just checked, and I'm running with --log.file.filter info on the command line, so I'm assuming that's why I'm not getting debug logs now. I'll change that to be debug on my next restart, but I don't want to interrupt my heap profiling run currently. I'm definitely running as a regular user directly from the command line, not via systemd or as root.

gbrew · 2024-05-06T18:50:03Z

maybe this happens because requests are coming in faster than we can write responses so the output buffer runs full

Are you thinking of a specific buffer which I can see on the dashboard? None of the ones I saw looked to be increasing in size during the memory growth FWIW.

gbrew · 2024-05-17T23:02:05Z

I just caught this crash again but with heap profiling enabled.
Attached are a couple heap profiles, one the overall profile right before death, and one a diff profile from a minute or so before death to death.

If I'm reading correctly, it looks like it is using 32GB of memory for the Vec allocated in get_headers_response().
I'm not sure if this is serving rpc or p2p requests. Looks like the code already tries to limit the size of this response Vec using MAX_HEADERS_SERVE and SOFT_RESPONSE_LIMIT, so it seems like the possibilities are that the limiting is broken, or that there are a huge number of header requests live at the same time?

gbrew · 2024-05-18T16:21:39Z

Here's the grafana jemalloc stats for the latest crash. Interesting that it has 5 episodes of linear memory growth in the 3 hours before crashing, but in all but the last of them it recovers before using up all the memory.

It does look consistent with continuing to process p2p requests but not sending the responses for long periods of time, along the lines of what @mattsse was suggesting above. Seems a bit crazy this would happen for more than an hour straight though. Maybe some explicit form of backpressure is needed to prioritize sending responses?

gbrew · 2024-05-20T04:22:15Z

I just found that the rate of incoming p2p header requests goes way up when the memory growth happens. It is about 1-2 reqs/s prior, then goes up to about 25 reqs/s during the uncontrolled memory growth. So it does seem like reth is just failing to handle that rate of header requests and the sending of responses gets starved and they pile up until memory is exhausted.

emhane · 2024-05-29T22:15:52Z

related #8131

scheminpete · 2024-06-10T21:05:49Z

Ran into the same issue. I didn't save logs because I figured it was a result of heavy RPC load (querying states for days). After checking my code and restarting, I can confirm it's not the RPC load (memory usage remains steady around 30GB when I start).

If it happens again I'll include the metrics tag on restart and and keep track of p2p header requests.

Execution: Reth v0.2.0-beta.8
Consensus: lighthouse v5.1.3 , mainnet
Platform: linux x86
Machine: 128gb DDR5, 4TB gen5 nvme

github-actions · 2024-07-07T01:56:24Z

This issue is stale because it has been open for 21 days with no activity.

github-actions · 2024-07-14T01:58:23Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

0xmichalis · 2024-10-09T19:59:01Z

I am seeing OOM kills happen fairly frequently too. I haven't installed the monitoring stack yet but I assume it's related to p2p as I don't utilize rpc in any way.

$ docker run ghcr.io/paradigmxyz/reth --version
reth Version: 1.0.5
Commit SHA: 603e39ab74509e0863fc023461a4c760fb2126d1
Build Timestamp: 2024-08-12T14:15:11.150189859Z
Build Features: asm_keccak,jemalloc
Build Profile: maxperf

[35797.744437] Out of memory: Killed process 10263 (reth) total-vm:5382410624kB, anon-rss:44717552kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:790640kB oom_score_adj:0
[498231.524734] Out of memory: Killed process 60546 (reth) total-vm:5383626404kB, anon-rss:44783072kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1193532kB oom_score_adj:0
[2412844.173887] Out of memory: Killed process 32085 (reth) total-vm:5057062080kB, anon-rss:38217524kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1199060kB oom_score_adj:0
[4268503.040666] Out of memory: Killed process 46653 (reth) total-vm:5065120296kB, anon-rss:38309664kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1275744kB oom_score_adj:0

Rjected · 2024-10-09T20:14:23Z

I am seeing OOM kills happen fairly frequently too. I haven't installed the monitoring stack yet but I assume it's related to p2p as I don't utilize rpc in any way.

do you mind setting up monitoring and/or checking what docker's memory limits are? And could you file a new issue for this?

0xmichalis · 2024-10-09T20:25:19Z

do you mind setting up monitoring and/or checking what docker's memory limits are? And could you file a new issue for this?

No limits are applied by docker (or podman in this instance). If you notice the dmesg output I pasted above, reth was OOM-killed a few times - caught while consuming up to 45 GB twice, which is very close to the available memory my server has when reth is not running. I have reduced the outbound and inbound peers from the defaults and will monitor for future instances of this. Does it make sense to open another issue without further details? I won't get to set up the monitoring stack anytime soon.

gbrew · 2024-10-14T20:14:50Z

Quick update: I've continued seeing an OOM crash every month or so as of v1.0.4. I haven't done any work to verify that these crashes are the same issue laid out here, but I'd guess they are.

emhane · 2024-10-15T09:24:04Z

what commit are you running @gbrew ?

gbrew · 2024-10-15T16:47:16Z

what commit are you running @gbrew ?

I was seeing crashes most recently on 443f7d5 . I just updated to v1.1.0, so will report if I see anything there in the future.

Rjected · 2024-10-15T18:28:25Z

@gbrew can we move this to a new issue? Just so that we have the full context in one place. @0xmichalis feel free to comment on the new issue as well

gbrew · 2024-10-25T21:22:02Z

@gbrew can we move this to a new issue? Just so that we have the full context in one place. @0xmichalis feel free to comment on the new issue as well

Of course, I'll open a fresh issue if I see this happen again.

github-actions · 2024-11-28T02:07:32Z

This issue is stale because it has been open for 21 days with no activity.

github-actions · 2024-12-05T02:09:26Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

gbrew added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels May 5, 2024

github-project-automation bot added this to Reth Tracker May 5, 2024

github-project-automation bot moved this to Todo in Reth Tracker May 5, 2024

DaniPopes removed the S-needs-triage This issue needs to be labelled label May 6, 2024

Nojoix mentioned this issue May 6, 2024

op-reth killed by out-of-memory killer #8131

Closed

1 task

emhane mentioned this issue May 29, 2024

perf(net): decrease budget EthRequestHandler + metrics #8497

Merged

emhane added the A-networking Related to networking in general label May 29, 2024

github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Jul 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2024

github-project-automation bot moved this from Todo to Done in Reth Tracker Jul 14, 2024

emhane reopened this Oct 10, 2024

emhane mentioned this issue Oct 10, 2024

fix(net): decrease budget for header reqs to process before yielding thread #11636

Merged

emhane mentioned this issue Oct 10, 2024

Expose budgets for p2p via cli #11637

Closed

github-actions bot removed the S-stale This issue/PR is stale and will close with no further activity label Oct 11, 2024

github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Nov 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reth killed by out-of-memory killer #8115

reth killed by out-of-memory killer #8115

gbrew commented May 5, 2024

gbrew commented May 5, 2024

Rjected commented May 6, 2024

gbrew commented May 6, 2024

Rjected commented May 6, 2024

mattsse commented May 6, 2024 •

edited

Loading

gbrew commented May 6, 2024

gbrew commented May 6, 2024

gbrew commented May 17, 2024 •

edited

Loading

gbrew commented May 18, 2024 •

edited

Loading

gbrew commented May 20, 2024

emhane commented May 29, 2024

scheminpete commented Jun 10, 2024

github-actions bot commented Jul 7, 2024

github-actions bot commented Jul 14, 2024

0xmichalis commented Oct 9, 2024

Rjected commented Oct 9, 2024

0xmichalis commented Oct 9, 2024

gbrew commented Oct 14, 2024

emhane commented Oct 15, 2024

gbrew commented Oct 15, 2024

Rjected commented Oct 15, 2024

gbrew commented Oct 25, 2024

github-actions bot commented Nov 28, 2024

github-actions bot commented Dec 5, 2024

reth killed by out-of-memory killer #8115

reth killed by out-of-memory killer #8115

Comments

gbrew commented May 5, 2024

Describe the bug

Steps to reproduce

Node logs

Platform(s)

What version/commit are you on?

What database version are you on?

Which chain / network are you on?

What type of node are you running?

What prune config do you use, if any?

If you've built Reth from source, provide the full command you used

Code of Conduct

gbrew commented May 5, 2024

Rjected commented May 6, 2024

gbrew commented May 6, 2024

Rjected commented May 6, 2024

mattsse commented May 6, 2024 • edited Loading

gbrew commented May 6, 2024

gbrew commented May 6, 2024

gbrew commented May 17, 2024 • edited Loading

gbrew commented May 18, 2024 • edited Loading

gbrew commented May 20, 2024

emhane commented May 29, 2024

scheminpete commented Jun 10, 2024

github-actions bot commented Jul 7, 2024

github-actions bot commented Jul 14, 2024

0xmichalis commented Oct 9, 2024

Rjected commented Oct 9, 2024

0xmichalis commented Oct 9, 2024

gbrew commented Oct 14, 2024

emhane commented Oct 15, 2024

gbrew commented Oct 15, 2024

Rjected commented Oct 15, 2024

gbrew commented Oct 25, 2024

github-actions bot commented Nov 28, 2024

github-actions bot commented Dec 5, 2024

mattsse commented May 6, 2024 •

edited

Loading

gbrew commented May 17, 2024 •

edited

Loading

gbrew commented May 18, 2024 •

edited

Loading