Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #20582. Non-executable transactions should not be evicted. #21119

Conversation

villanuevawill
Copy link
Contributor

Closes #20582.

Description

Queued transactions should not be evicted until after evictionInterval has passed. Successful promotion of transactions via promoteExecutables, pending replacement via add(...), or insertion into the queue via enqueueTx should update the last heartbeat. The heartbeat is no longer removed when the pending list is emptied or after a minute has passed in the future queue.

Tests are expanded to ensure queued transactions are not removed before the evictionInterval duration has passed. Additionally, tests are expanded to ensure a reset or emptied pending list does not clear out the heart beat.

Logging is expanded to detect evictions as a method to prevent/monitor issues in the future.

Monitoring Before & After

We see a 5x reduction in queue evictions and the queue is able to reach capacity. Previously, I was seeing 5 tx/s evicted from the queue, that has been reduced to 1 tx/s evicted. Additionally, the queue is able to grow to its limit, from ~500 to 1024 in the case of my config.

Before:
image
image

After:
image
image

Questions

There could be a case for removing heartbeat updates on pending replaced and in promote executables, however, it seems optimal to keep the queue fresh when new activity is seen and to manage ordering appropriately on truncation.

@villanuevawill
Copy link
Contributor Author

villanuevawill commented May 27, 2020

@karalabe issue #20582 was originally assigned to you it looks like, so tagging you on here. I've been working in txpool code due to our work on account abstraction, so wanted to fix issues along the way. lmk if anything you'd like updated here.

@holiman
Copy link
Contributor

holiman commented May 27, 2020

@karalabe - I'm thinking it may be a good idea to spin this up on one of the bootnodes. Want me to do that?

@karalabe
Copy link
Member

@holiman Pls do. The PR seems tiny enough, will try to review tomorrow.

@holiman
Copy link
Contributor

holiman commented May 27, 2020

I'll rebase it first

@holiman holiman force-pushed the eth/txpool-fix-queued-evictions branch from 953b9fe to 72ece8e Compare May 27, 2020 17:11
@holiman
Copy link
Contributor

holiman commented May 27, 2020

aws-eu-central, last six hours before redeploying with this PR
Screenshot_2020-05-27 Single Geth - Grafana

and the last 15 minutes
Screenshot_2020-05-27 Single Geth - Grafana(1)

@holiman
Copy link
Contributor

holiman commented May 27, 2020

After switching to this PR, 15 minutes:
Screenshot_2020-05-27 Single Geth - Grafana(2)

Here are a couple of hours, capturing both before and after switch

Screenshot_2020-05-27 Single Geth - Grafana(3)

Zooming in a bit

Screenshot_2020-05-27 Single Geth - Grafana(4)

@holiman
Copy link
Contributor

holiman commented May 27, 2020

Not sure why this graph looks like this though: (same time-segment as above)
Screenshot_2020-05-27 Single Geth - Grafana(5)

@holiman
Copy link
Contributor

holiman commented Jun 2, 2020

This PR lgtm, has been running stable for a while now

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rjl493456442
Copy link
Member

rjl493456442 commented Jun 4, 2020

the beats map is a little confusing for me. It seems to represent the Last heartbeat from each known account(from the comment) but actually it's not in the current code or PR.

Now we use beats for two things: (1) pick the least active users to evict queue transaction (2) drop old enough queue transactions. For the latter one, I think the best solution may record the timestamp of each transaction(but tbh not sure whether it's worthwhile to add this complexity).

And for the former, it's even worse. We only record the beat when we successfully promote a transaction. But if a user caches lots of queue transaction but never successfully promote any one. Then the queue transactions belong to this user will never be dropped.

@holiman
Copy link
Contributor

holiman commented Jun 4, 2020

Another comparison chart, this PR (green) against another bootnode:
Screenshot_2020-06-04 Multi Geth - Grafana

@villanuevawill
Copy link
Contributor Author

villanuevawill commented Jun 17, 2020

Now we use beats for two things: (1) pick the least active users to evict queue transaction (2) drop old enough queue transactions. For the latter one, I think the best solution may record the timestamp of each transaction(but tbh not sure whether it's worthwhile to add this complexity).

@rjl493456442 beats are also used when the queue reaches capacity to determine which transactions to throw out by ordering the set - https://github.com/ethereum/go-ethereum/blob/master/core/tx_pool.go#L1341

For the latter one, I think the best solution may record the timestamp of each transaction(but tbh not sure whether it's worthwhile to add this complexity).

Yes, this would be ideal.

We only record the beat when we successfully promote a transaction. But if a user caches lots of queue transaction but never successfully promote any one. Then the queue transactions belong to this user will never be dropped.

You are right, if an account continues to have enqueued transactions yet it never makes it to pending, then these queued transactions will remain while they continue to actively include new queued transactions. However, in the current scenario, these just get dropped on every eviction tick.

A more optimal solution would be to:

  1. Have a timestamp for each separate tx as you mentioned
  2. Update the current heartbeat on each promotion or pending replaced (not on enqueue)
  3. Only evict queue items past their timestamp

If you would like, I'm happy to put up a PR on this optimal solution (I've been in the txpool code for a while at this point and wouldn't mind).

@villanuevawill
Copy link
Contributor Author

villanuevawill commented Jun 17, 2020

For the latter one, I think the best solution may record the timestamp of each transaction(but tbh not sure whether it's worthwhile to add this complexity).

I'll update the PR soon with the optimal solution and an updated explanation - it doesn't appear to add much complexity. Sorry for the late reply on this thread - I missed the notifications.

@villanuevawill villanuevawill force-pushed the eth/txpool-fix-queued-evictions branch 2 times, most recently from 4174a3a to 5cee963 Compare June 23, 2020 03:32
@villanuevawill
Copy link
Contributor Author

villanuevawill commented Jun 23, 2020

@holiman & @rjl493456442 I've made the updates described above and this is ready for review.

image
image
image

The pool is stable locally with the updates. I've updated to:

  1. Only update the heartbeat on promotion or pending replaced (not on enqueue)
  2. Clear the heartbeat when pending is removed
  3. Add a ts for each transaction hash
  4. Evict a transaction from the queue only if the heartbeat is past the lifetime and the transaction ts is past the lifetime

I've added thorough testing to check the edge cases. Also, I updated validateTxPoolInternals in the tests to verify internals are correct and that the length of ts entries is always equal to the # of transactions in the queue: https://github.com/ethereum/go-ethereum/pull/21119/files#diff-59daeee0a5084a262da2e62280567af0R111 - this adds additional security to the changes and tests this thoroughly.

@villanuevawill villanuevawill requested a review from holiman June 23, 2020 03:41
core/tx_pool.go Outdated
@@ -363,7 +366,10 @@ func (pool *TxPool) loop() {
// Any non-locals old enough should be removed
if time.Since(pool.beats[addr]) > pool.config.Lifetime {
for _, tx := range pool.queue[addr].Flatten() {
pool.removeTx(tx.Hash(), true)
if time.Since(pool.queued_ts[tx.Hash()]) > pool.config.Lifetime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a mismatch between what's in pool.queued_tx and other pools, the pool.queued_ts[tx.Hash()] will return 0, and flush out the tx. I suppose that is indeed the correct way to handle that error.
For testing though, it might be interesting to make it panic, and run it for a few days with a panic in there, just to see that we don't have any inconsistencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated this to panic, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was a bit unclear. We can't merge it to master with a panic in there, it was more of a note-to-self (or whoever puts it on a prod server) to put a panic check there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop the last commit and rebase it on master, ok?

core/tx_pool.go Outdated
@@ -363,7 +367,13 @@ func (pool *TxPool) loop() {
// Any non-locals old enough should be removed
if time.Since(pool.beats[addr]) > pool.config.Lifetime {
for _, tx := range pool.queue[addr].Flatten() {
pool.removeTx(tx.Hash(), true)
if pool.queued_ts[tx.Hash()].IsZero() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holiman here we include the check and panic.

… on each tick if there are no promote transactions or if a pending/reset empties the pending list. Tests and logging expanded to handle these cases in the future.
@holiman holiman force-pushed the eth/txpool-fix-queued-evictions branch from 2423562 to 45334d9 Compare July 6, 2020 13:18
@villanuevawill
Copy link
Contributor Author

@holiman updates applied 💯

@holiman
Copy link
Contributor

holiman commented Jul 6, 2020 via email

@villanuevawill
Copy link
Contributor Author

villanuevawill commented Jul 6, 2020

Great! Did you by any chance run it with panics enabled yourself?

Yes I did for a few days and it didn't panic.

@holiman
Copy link
Contributor

holiman commented Jul 24, 2020

Closed in favour of #21300

@holiman holiman closed this Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-local, non-executable transactions are 'randomly' removed from transaction pool queue
5 participants