Exclude partition reward accounts in account hash calculation #34809

HaoranYi · 2024-01-17T15:54:04Z

Problem

We can run the validator against mainnet with the new partitioned reward code by enabling --partitioned-epoch-rewards-force-enable-single-slot CLI argument.

Recently, my test node running with the above CLI fails on account hash mismatch. It turns out that the mismatch is coming from the new partition reward account introduced at the epoch, i.e. #34624.

In order for this CLI continue to work after #34624, we need to ignore the epoch partition reward accounts for account hash calculation.

Summary of Changes

Ignore partition reward accounts for account's hash calculation when we force enable partitioned rewards

Fixes #

HaoranYi · 2024-01-17T17:07:44Z

I have restarted my test node with this PR against mainnet.

Currently, we are at 10% of epoch 561.

Block height: 223259936
Slot: 242398091
Epoch: 561
Transaction Count: 262218377112
Epoch Slot Range: [242352000..242784000)
Epoch Completed Percent: 10.669%
Epoch Completed Slots: 46091/432000 (385909 remaining)
Epoch Completed Time: 5h 25m 31s/2days 2h 1m 9s (1day 20h 35m 38s remaining)

We will see if this fix works in 2 days.

codecov · 2024-01-17T17:12:42Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (f9bfb60) 81.7% compared to head (242f79f) 81.7%.
Report is 36 commits behind head on master.

❗ Current head 242f79f differs from pull request most recent head 54918e0. Consider uploading reports for the commit 54918e0 to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #34809    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         826      825     -1     
  Lines      223371   223302    -69     
========================================
+ Hits       182578   182629    +51     
+ Misses      40793    40673   -120

HaoranYi · 2024-01-19T15:41:47Z

Test node with this PR runs successfully across the epoch boundary!

Block height: 223643554
Slot: 242798190
Epoch: 562
Transaction Count: 262687685231
Epoch Slot Range: [242784000..243216000)
Epoch Completed Percent: 3.285%
Epoch Completed Slots: 14190/432000 (417810 remaining)
Epoch Completed Time: 1h 38m 43s/2days 1h 34m 38s (1day 23h 55m 55s remaining)

brooksprumo · 2024-01-22T14:57:30Z

Does this bug impact the partitioned epoch rewards feature gate itself, or only the -partitioned-epoch-rewards-force-enable-single-slot cli flag?

HaoranYi · 2024-01-22T15:10:38Z

Does this bug impact the partitioned epoch rewards feature gate itself, or only the -partitioned-epoch-rewards-force-enable-single-slot cli flag?

No, it desn't. Only when -partitioned-epoch-rewards-force-enable-single-slot is used before the feature is activated.

HaoranYi · 2024-01-22T15:12:55Z

Verify that this PR works

1. Running jw14 without the PR results a crash at the epoch boundary.

Slot: 243418619
Epoch: 563
Transaction Count: 263412977100
Epoch Slot Range: [243216000..243648000)
Epoch Completed Percent: 46.903%
Epoch Completed Slots: 202619/432000 (229381 remaining)
Epoch Completed Time: 23h 20m 28s/2days 1h 27m 54s (1day 2h 7m 26s remaining)

thread 'solReplayStage' panicked at core/src/replay_stage.rs:1414:25:
We have tried to repair duplicate slot: 243216000 more than 10 times and are unable to freeze a block with bankhash EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF, instead we have a block with bankhash Some(Fg8G6EjERZmTBSn8rKCdTRqmHVx6JE4BouyRHeQZYnMS). T
his is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: hashbrown::map::HashMap<K,V,S,A>::retain
   3: solana_core::replay_stage::ReplayStage::dump_then_repair_correct_slots
   4: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-01-21T15:48:42.256475777Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
[2024-01-21T15:48:42.256475777Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
    We have tried to repair duplicate slot: 243216000 more than 10 times and are unable to freeze a block with bankhash EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF, instead we have a block with bankhash Some(Fg8G6EjERZmTBSn8rKCdTRqmHVx6JE4BouyRHeQZYnMS). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting" location="core/src/replay_stage.rs:1414:25" version="1.18.0 (src:00000000; feat:4046558620, client:SolanaLabs)"

log from jw14 running without this PR for the most recent epoch boundary.
ath the epoch boundary 243216000., the node crashes because of the bankhash mismatch.

2. Running ledger-tool with the PR on jw14 successfully.

 ./cargo run --release --bin solana-ledger-tool -- --ledger ~/ledger verify --partitioned-epoch-rewards-force-enable-single-slot --halt-at-slot 243216000 2>&1 | tee verify.log

[2024-01-22T15:53:39.518831156Z INFO  solana_runtime::bank] bank frozen: 243216000 hash: EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF accounts_delta: 7pNJmQoSte2DL18ZMgKfsz3mZYURgeFkTo7Et9GSoFWG signature_count: 0 last_blockhash: ES2RQg1vA1Vgpz1qmDgWB2KJyGC4jivHr2PRoStSh5Ra capitalization: 567729394157396052, stats: BankHashStats { num_updated_accounts: 829464, num_removed_accounts: 1, num_lamports_stored: 391291053032340887, total_data_len: 173784497, num_executable_accounts: 2 }

EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF bank hash matched.

Furthermore, the log shows that epoch_reward partition account is created as expected!

[2024-01-22T15:53:30.882687996Z INFO  solana_runtime::bank] create epoch_reward partition data account F6amLHBgM4h9AHz9BcyRsUymCbtMvd739Sd1WZKGdscP V0(
        PartitionData {
            num_partitions: 1,
            parent_blockhash: 4ePEv3ksbuvAehYXFCHPsj3z5jRJ8MLC1bp8V8ma7aiS,
            hasher_kind: Sip13,
        },
    )

…wards enabled

…it in accounts db

brooksprumo · 2024-01-22T19:13:22Z

Is this an issue because with the CLI flag we create the PDA account, thus producing a different bank hash because the rest of the cluster does not have this account?

HaoranYi · 2024-01-22T19:34:32Z

Is this an issue because with the CLI flag we create the PDA account, thus producing a different bank hash because the rest of the cluster does not have this account?

Yeah. Exactly.

brooksprumo · 2024-01-22T20:02:11Z

Yeah. Exactly.

Great! Then yes, this idea looks safe to me. We already do the same thing for the sysvar accounts, so also ignoring another account makes sense.

jeffwashington · 2024-01-22T20:07:37Z

will capitalization still match with the bank? does this sysvar have any lamports?

jeffwashington · 2024-01-22T20:11:35Z

part 2. When is the sysvar deleted? We had to work on the earlier cli pr in order to make sure we deleted the sysvar that has the rewards themselves (as lamports). Since the cli arg does it all in one spot, the rewards sysvar gets created, drained, then deleted.

HaoranYi · 2024-01-22T20:20:19Z

will capitalization still match with the bank? does this sysvar have any lamports?

Good question. No, capitalization will be mismatching. We need to fix that too.

The new accounts are not sysvar accounts. One account is created for each epoch. And they won't be deleted. Instead, they will stay forever.

HaoranYi · 2024-01-22T20:24:20Z

Also, turning the CLI on/off affects epoch_account_hash too.

I have been running a node that toggles this CLI between epoch and it seems to result in epoch_account_hash mismatches. We have to fix epoch account hash too, i.e. (exclude the PDA accounts).

#34876 has more details.

HaoranYi · 2024-01-22T22:53:20Z

I restarted jw14 with the new fixes for lamport and epoch account hash. Let's see if this works.

Block height: 224307951
Slot: 243486306
Epoch: 563
Transaction Count: 263486182012
Epoch Slot Range: [243216000..243648000)
Epoch Completed Percent: 62.571%
Epoch Completed Slots: 270306/432000 (161694 remaining)
Epoch Completed Time: 1day 7h 6m 6s/2days 1h 39m 5s (18h 32m 59s remaining)

We will know at 564.75 epoch, now at 563.62 epoch, ~ 1.13 epoch away.

brooksprumo · 2024-01-22T23:14:31Z

accounts-db/src/partitioned_rewards.rs

+        if self.addresses.is_empty() {
+            self.addresses.insert(sysvar::epoch_rewards::id());
+
+            for e in 0..=epoch {
+                let pda = get_epoch_rewards_partition_data_address(e);
+                self.addresses.insert(pda);
+            }
+            self.max_epoch = epoch;
+        } else if epoch > self.max_epoch {
+            for e in self.max_epoch + 1..=epoch {
+                let pda = get_epoch_rewards_partition_data_address(e);
+                self.addresses.insert(pda);
+            }
+            self.max_epoch = epoch;
+        }


There's a lot of duplication in the if and else branches; can that be factored out?

yeah. refactored!

accounts-db/src/partitioned_rewards.rs

brooksprumo · 2024-01-22T23:17:00Z

runtime/src/bank.rs

+        // ignore reward PDA account lamport when we are testing partitioned rewards before the actual feature activation.
+        if !self.is_partitioned_rewards_feature_enabled()
+            && (self
+                .partitioned_epoch_rewards_config()
+                .test_enable_partitioned_rewards
+                && self.get_reward_calculation_num_blocks() == 0
+                && self.partitioned_rewards_stake_account_stores_per_block() == u64::MAX)
+        {
+            self.capitalization
+                .fetch_sub(new_account.lamports(), Relaxed);
+        }


Instead of selectively ignoring the PDA and subtracting the lamports, can we just not store the PDA account when testing?

That's an option, and probably the simplest option. However, this won't allow us to test the new reward rpc methods.

New reward rpc method require that those accounts exist in the account-db. The new rpc method, which will read these accounts and use them to extract stake rewards from the corresponding reward blocks.

If --partitioned-epoch-rewards-force-enable-single-slot is supplied, is it even possible to exercise the new RPC method? Or is the full feature gate required?

Maybe this is something that impacts tests but not running a node? I'm not sure, just trying to think through all the combinations.

Good question. If the new RPC method is feature gated, then we won't be able to run the method without feature activation.

But I think the RPC method will have no access to the feature gate. Instead, it relies on whether the PDA account exists or not to determine which code path to run. If the PDA account exists, then it will follow the code path that extracts rewards from partitioned reward blocks. Otherwise, it will follow the old code path to extract rewards from the first block in the epoch.

@CriesofCarrots may confirm this?

To clarify, it's not a new RPC method, just an adaptation of the existing getInflationReward. And yes, it checks whether the feature gate has been enabled to decide which block(s) to pull to find rewards. https://github.com/solana-labs/solana/pull/34773/files#diff-b55d9bcb5d2e7eae09fd352edcc765f9c2a4c05b97e9847582071d35850e8671R607
It will not know anything about the --partitioned-epoch-rewards-force-enable-single-slot flag. Since that flag puts all rewards in the first slot anyway, the RPC method should still work. But it's not a particularly useful test case for the RPC changes.

OK.
It looks like we are unlikely to use --partitioned-epoch-rewards-force-enable-single-slot as a way to test the getInflationReward rpc. This PR seems to be too "hacky" anyway.

I am closing this PR and create a new PR to get --partitioned-epoch-rewards-force-enable-single-slot working by simply not storing the new PDA account when the flag is active.

CriesofCarrots · 2024-01-24T19:06:24Z

I'm holding off review until churn is done, and CI is green. Feel free to give me a nudge if it's at that point

HaoranYi · 2024-01-25T15:33:42Z

My test node passed the epoch account hash for epoch 564. The fix works!

HaoranYi changed the title ~~Exclude partition reward account in account hash calculation~~ Exclude partition reward accounts in account hash calculation Jan 17, 2024

HaoranYi force-pushed the 2024_jan_17 branch 2 times, most recently from 8f14ef4 to 242f79f Compare January 18, 2024 15:44

HaoranYi force-pushed the 2024_jan_17 branch from 9090be7 to f8b2ad1 Compare January 19, 2024 16:59

HaoranYi requested review from jeffwashington, brooksprumo and CriesofCarrots January 22, 2024 16:05

HaoranYi added 4 commits January 22, 2024 16:14

ignore reward pda account for account hash when testing paritioned re…

bef39ef

…wards enabled

don't recompute reward account address for every bank, instead cache …

ed708e4

…it in accounts db

log epoch reward partition data account

62d01b9

log slot

207718f

HaoranYi force-pushed the 2024_jan_17 branch from 75733d2 to 207718f Compare January 22, 2024 16:14

HaoranYi added 2 commits January 22, 2024 21:06

exclude reward dpa account from full/incremental accounts hash

34f948e

fix lamport mismatch

d76a5a3

HaoranYi mentioned this pull request Jan 22, 2024

Bankhash mismatch running #34623 against mainnet #34876

Closed

HaoranYi force-pushed the 2024_jan_17 branch 2 times, most recently from a384e53 to d76a5a3 Compare January 22, 2024 22:28

brooksprumo reviewed Jan 22, 2024

View reviewed changes

pr review: share common code

54918e0

HaoranYi closed this Jan 25, 2024

HaoranYi mentioned this pull request Jan 25, 2024

Skip creating reward partition account for --partitioned-epoch-rewards-force-enable-single-slot #34946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude partition reward accounts in account hash calculation #34809

Exclude partition reward accounts in account hash calculation #34809

HaoranYi commented Jan 17, 2024 •

edited

Loading

HaoranYi commented Jan 17, 2024

codecov bot commented Jan 17, 2024 •

edited

Loading

HaoranYi commented Jan 19, 2024

brooksprumo commented Jan 22, 2024

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

brooksprumo commented Jan 22, 2024

HaoranYi commented Jan 22, 2024

brooksprumo commented Jan 22, 2024 •

edited

Loading

jeffwashington commented Jan 22, 2024

jeffwashington commented Jan 22, 2024

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

brooksprumo Jan 22, 2024

HaoranYi Jan 24, 2024

brooksprumo Jan 22, 2024

HaoranYi Jan 22, 2024

brooksprumo Jan 22, 2024

HaoranYi Jan 23, 2024

CriesofCarrots Jan 24, 2024

HaoranYi Jan 25, 2024 •

edited

Loading

CriesofCarrots commented Jan 24, 2024

HaoranYi commented Jan 25, 2024

Exclude partition reward accounts in account hash calculation #34809

Exclude partition reward accounts in account hash calculation #34809

Conversation

HaoranYi commented Jan 17, 2024 • edited Loading

Problem

Summary of Changes

HaoranYi commented Jan 17, 2024

codecov bot commented Jan 17, 2024 • edited Loading

Codecov Report

HaoranYi commented Jan 19, 2024

brooksprumo commented Jan 22, 2024

HaoranYi commented Jan 22, 2024 • edited Loading

HaoranYi commented Jan 22, 2024 • edited Loading

Verify that this PR works

brooksprumo commented Jan 22, 2024

HaoranYi commented Jan 22, 2024

brooksprumo commented Jan 22, 2024 • edited Loading

jeffwashington commented Jan 22, 2024

jeffwashington commented Jan 22, 2024

HaoranYi commented Jan 22, 2024 • edited Loading

HaoranYi commented Jan 22, 2024 • edited Loading

HaoranYi commented Jan 22, 2024 • edited Loading

brooksprumo Jan 22, 2024

Choose a reason for hiding this comment

HaoranYi Jan 24, 2024

Choose a reason for hiding this comment

brooksprumo Jan 22, 2024

Choose a reason for hiding this comment

HaoranYi Jan 22, 2024

Choose a reason for hiding this comment

brooksprumo Jan 22, 2024

Choose a reason for hiding this comment

HaoranYi Jan 23, 2024

Choose a reason for hiding this comment

CriesofCarrots Jan 24, 2024

Choose a reason for hiding this comment

HaoranYi Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

CriesofCarrots commented Jan 24, 2024

HaoranYi commented Jan 25, 2024

HaoranYi commented Jan 17, 2024 •

edited

Loading

codecov bot commented Jan 17, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

brooksprumo commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi commented Jan 22, 2024 •

edited

Loading

HaoranYi Jan 25, 2024 •

edited

Loading