-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous backlog population #3999
Continuous backlog population #3999
Conversation
It looks like the unit test 'request_aggregator.cannot_vote' failed 4 times. |
@dsiganos I see, there seems to be two tests that are still failing, I'm looking into that. Appears to only break on GH runners, so a bit annoying to debug. |
It fails when the system is under heavy load. Starting a parallel build from scratch seems to make it crash. I got this crash on my laptop:
|
I left a number of minor comments but it looks good to me overall. |
64218b6
to
ce52a8f
Compare
98227de
to
3e60600
Compare
…xpects that it will not receive a vote for send1 because it has not made such a request, however, if the election is still or recently active, it may receive a broadcast vote before it makes a request. Check that the election has ended on node1 and allow some time for in-flight votes broadcasts to finish before starting node2.
3e60600
to
e848e09
Compare
Backlog population is a process in which a node scans all accounts in the ledger, with or without any confirmed blocks, and forwards (activates) those accounts which do not have all their blocks confirmed to election scheduler for prioritization and eventual queuing in proper bucket. It is necessary to do this periodically, because the amount of space in each bucket is limited (currently ~2000 entries) and number of accounts needing confirmations can be much higher than that, especially during bootstrap or network spam attack.
The problem with current implementation is that this process runs every 5 minutes and scans the whole ledger at once, leading to situations where we run out of accounts to prioritize before the next run has started. This is especially visible during bootstrapping, a graph showing such situation is included below. We can clearly see the bumps in AEC occupancy where prioritization queue is filled, followed by periods of idleness when priority queue is emptied:
This PR fixes that by modifying the way the ledger scan is done. Instead of 5 minute interval, we run the scan all the time (unless disabled by setting
frontiers_confirmation = disabled
node config setting), but we throttle the rate at which the scan is done to limit consumption of node resources. The rate and frequency is controlled by two new node-config.toml settings:backlog_scan_batch_size
andbacklog_scan_frequency
. By default it scans 10000 accounts per second divided into 10 batches, so 1000 accounts per batch. This is rather conservative and should be later adjusted with feedback from beta node operators (before this PR we dit it in batches of 64k).The result of this PR is the AEC that stays full almost all the time (except the initial phase of the bootstrap):