Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Exclude partition reward accounts in account hash calculation #34809

Closed
wants to merge 7 commits into from

Conversation

HaoranYi
Copy link
Contributor

@HaoranYi HaoranYi commented Jan 17, 2024

Problem

We can run the validator against mainnet with the new partitioned reward code by enabling --partitioned-epoch-rewards-force-enable-single-slot CLI argument.

Recently, my test node running with the above CLI fails on account hash mismatch. It turns out that the mismatch is coming from the new partition reward account introduced at the epoch, i.e. #34624.

In order for this CLI continue to work after #34624, we need to ignore the epoch partition reward accounts for account hash calculation.

Summary of Changes

Ignore partition reward accounts for account's hash calculation when we force enable partitioned rewards

Fixes #

@HaoranYi HaoranYi changed the title Exclude partition reward account in account hash calculation Exclude partition reward accounts in account hash calculation Jan 17, 2024
@HaoranYi
Copy link
Contributor Author

I have restarted my test node with this PR against mainnet.

Currently, we are at 10% of epoch 561.

Block height: 223259936
Slot: 242398091
Epoch: 561
Transaction Count: 262218377112
Epoch Slot Range: [242352000..242784000)
Epoch Completed Percent: 10.669%
Epoch Completed Slots: 46091/432000 (385909 remaining)
Epoch Completed Time: 5h 25m 31s/2days 2h 1m 9s (1day 20h 35m 38s remaining)

We will see if this fix works in 2 days.

Copy link

codecov bot commented Jan 17, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (f9bfb60) 81.7% compared to head (242f79f) 81.7%.
Report is 36 commits behind head on master.

❗ Current head 242f79f differs from pull request most recent head 54918e0. Consider uploading reports for the commit 54918e0 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #34809    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         826      825     -1     
  Lines      223371   223302    -69     
========================================
+ Hits       182578   182629    +51     
+ Misses      40793    40673   -120     

@HaoranYi HaoranYi force-pushed the 2024_jan_17 branch 2 times, most recently from 8f14ef4 to 242f79f Compare January 18, 2024 15:44
@HaoranYi
Copy link
Contributor Author

Test node with this PR runs successfully across the epoch boundary!

Block height: 223643554
Slot: 242798190
Epoch: 562
Transaction Count: 262687685231
Epoch Slot Range: [242784000..243216000)
Epoch Completed Percent: 3.285%
Epoch Completed Slots: 14190/432000 (417810 remaining)
Epoch Completed Time: 1h 38m 43s/2days 1h 34m 38s (1day 23h 55m 55s remaining)

@brooksprumo
Copy link
Contributor

Does this bug impact the partitioned epoch rewards feature gate itself, or only the -partitioned-epoch-rewards-force-enable-single-slot cli flag?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

Does this bug impact the partitioned epoch rewards feature gate itself, or only the -partitioned-epoch-rewards-force-enable-single-slot cli flag?

No, it desn't. Only when -partitioned-epoch-rewards-force-enable-single-slot is used before the feature is activated.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

Verify that this PR works

1. Running jw14 without the PR results a crash at the epoch boundary.

Slot: 243418619
Epoch: 563
Transaction Count: 263412977100
Epoch Slot Range: [243216000..243648000)
Epoch Completed Percent: 46.903%
Epoch Completed Slots: 202619/432000 (229381 remaining)
Epoch Completed Time: 23h 20m 28s/2days 1h 27m 54s (1day 2h 7m 26s remaining)
thread 'solReplayStage' panicked at core/src/replay_stage.rs:1414:25:
We have tried to repair duplicate slot: 243216000 more than 10 times and are unable to freeze a block with bankhash EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF, instead we have a block with bankhash Some(Fg8G6EjERZmTBSn8rKCdTRqmHVx6JE4BouyRHeQZYnMS). T
his is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: hashbrown::map::HashMap<K,V,S,A>::retain
   3: solana_core::replay_stage::ReplayStage::dump_then_repair_correct_slots
   4: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-01-21T15:48:42.256475777Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
[2024-01-21T15:48:42.256475777Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
    We have tried to repair duplicate slot: 243216000 more than 10 times and are unable to freeze a block with bankhash EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF, instead we have a block with bankhash Some(Fg8G6EjERZmTBSn8rKCdTRqmHVx6JE4BouyRHeQZYnMS). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting" location="core/src/replay_stage.rs:1414:25" version="1.18.0 (src:00000000; feat:4046558620, client:SolanaLabs)"

log from jw14 running without this PR for the most recent epoch boundary.
ath the epoch boundary 243216000., the node crashes because of the bankhash mismatch.

2. Running ledger-tool with the PR on jw14 successfully.

 ./cargo run --release --bin solana-ledger-tool -- --ledger ~/ledger verify --partitioned-epoch-rewards-force-enable-single-slot --halt-at-slot 243216000 2>&1 | tee verify.log
[2024-01-22T15:53:39.518831156Z INFO  solana_runtime::bank] bank frozen: 243216000 hash: EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF accounts_delta: 7pNJmQoSte2DL18ZMgKfsz3mZYURgeFkTo7Et9GSoFWG signature_count: 0 last_blockhash: ES2RQg1vA1Vgpz1qmDgWB2KJyGC4jivHr2PRoStSh5Ra capitalization: 567729394157396052, stats: BankHashStats { num_updated_accounts: 829464, num_removed_accounts: 1, num_lamports_stored: 391291053032340887, total_data_len: 173784497, num_executable_accounts: 2 }

EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF bank hash matched.

Furthermore, the log shows that epoch_reward partition account is created as expected!

[2024-01-22T15:53:30.882687996Z INFO  solana_runtime::bank] create epoch_reward partition data account F6amLHBgM4h9AHz9BcyRsUymCbtMvd739Sd1WZKGdscP V0(
        PartitionData {
            num_partitions: 1,
            parent_blockhash: 4ePEv3ksbuvAehYXFCHPsj3z5jRJ8MLC1bp8V8ma7aiS,
            hasher_kind: Sip13,
        },
    )

@brooksprumo
Copy link
Contributor

Is this an issue because with the CLI flag we create the PDA account, thus producing a different bank hash because the rest of the cluster does not have this account?

@HaoranYi
Copy link
Contributor Author

Is this an issue because with the CLI flag we create the PDA account, thus producing a different bank hash because the rest of the cluster does not have this account?

Yeah. Exactly.

@brooksprumo
Copy link
Contributor

brooksprumo commented Jan 22, 2024

Yeah. Exactly.

Great! Then yes, this idea looks safe to me. We already do the same thing for the sysvar accounts, so also ignoring another account makes sense.

@jeffwashington
Copy link
Contributor

will capitalization still match with the bank? does this sysvar have any lamports?

@jeffwashington
Copy link
Contributor

part 2. When is the sysvar deleted? We had to work on the earlier cli pr in order to make sure we deleted the sysvar that has the rewards themselves (as lamports). Since the cli arg does it all in one spot, the rewards sysvar gets created, drained, then deleted.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

will capitalization still match with the bank? does this sysvar have any lamports?

Good question. No, capitalization will be mismatching. We need to fix that too.

The new accounts are not sysvar accounts. One account is created for each epoch. And they won't be deleted. Instead, they will stay forever.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

Also, turning the CLI on/off affects epoch_account_hash too.

I have been running a node that toggles this CLI between epoch and it seems to result in epoch_account_hash mismatches. We have to fix epoch account hash too, i.e. (exclude the PDA accounts).

#34876 has more details.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

I restarted jw14 with the new fixes for lamport and epoch account hash. Let's see if this works.

Block height: 224307951
Slot: 243486306
Epoch: 563
Transaction Count: 263486182012
Epoch Slot Range: [243216000..243648000)
Epoch Completed Percent: 62.571%
Epoch Completed Slots: 270306/432000 (161694 remaining)
Epoch Completed Time: 1day 7h 6m 6s/2days 1h 39m 5s (18h 32m 59s remaining)

We will know at 564.75 epoch, now at 563.62 epoch, ~ 1.13 epoch away.

Comment on lines 139 to 153
if self.addresses.is_empty() {
self.addresses.insert(sysvar::epoch_rewards::id());

for e in 0..=epoch {
let pda = get_epoch_rewards_partition_data_address(e);
self.addresses.insert(pda);
}
self.max_epoch = epoch;
} else if epoch > self.max_epoch {
for e in self.max_epoch + 1..=epoch {
let pda = get_epoch_rewards_partition_data_address(e);
self.addresses.insert(pda);
}
self.max_epoch = epoch;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of duplication in the if and else branches; can that be factored out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. refactored!

Comment on lines +3623 to +3633
// ignore reward PDA account lamport when we are testing partitioned rewards before the actual feature activation.
if !self.is_partitioned_rewards_feature_enabled()
&& (self
.partitioned_epoch_rewards_config()
.test_enable_partitioned_rewards
&& self.get_reward_calculation_num_blocks() == 0
&& self.partitioned_rewards_stake_account_stores_per_block() == u64::MAX)
{
self.capitalization
.fetch_sub(new_account.lamports(), Relaxed);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of selectively ignoring the PDA and subtracting the lamports, can we just not store the PDA account when testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an option, and probably the simplest option. However, this won't allow us to test the new reward rpc methods.

New reward rpc method require that those accounts exist in the account-db. The new rpc method, which will read these accounts and use them to extract stake rewards from the corresponding reward blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If --partitioned-epoch-rewards-force-enable-single-slot is supplied, is it even possible to exercise the new RPC method? Or is the full feature gate required?

Maybe this is something that impacts tests but not running a node? I'm not sure, just trying to think through all the combinations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. If the new RPC method is feature gated, then we won't be able to run the method without feature activation.

But I think the RPC method will have no access to the feature gate. Instead, it relies on whether the PDA account exists or not to determine which code path to run. If the PDA account exists, then it will follow the code path that extracts rewards from partitioned reward blocks. Otherwise, it will follow the old code path to extract rewards from the first block in the epoch.

@CriesofCarrots may confirm this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, it's not a new RPC method, just an adaptation of the existing getInflationReward. And yes, it checks whether the feature gate has been enabled to decide which block(s) to pull to find rewards. https://github.com/solana-labs/solana/pull/34773/files#diff-b55d9bcb5d2e7eae09fd352edcc765f9c2a4c05b97e9847582071d35850e8671R607
It will not know anything about the --partitioned-epoch-rewards-force-enable-single-slot flag. Since that flag puts all rewards in the first slot anyway, the RPC method should still work. But it's not a particularly useful test case for the RPC changes.

Copy link
Contributor Author

@HaoranYi HaoranYi Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
It looks like we are unlikely to use --partitioned-epoch-rewards-force-enable-single-slot as a way to test the getInflationReward rpc. This PR seems to be too "hacky" anyway.

I am closing this PR and create a new PR to get --partitioned-epoch-rewards-force-enable-single-slot working by simply not storing the new PDA account when the flag is active.

@CriesofCarrots
Copy link
Contributor

I'm holding off review until churn is done, and CI is green. Feel free to give me a nudge if it's at that point

@HaoranYi
Copy link
Contributor Author

My test node passed the epoch account hash for epoch 564. The fix works!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants