Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scheduler deadlock by waiting at DropBankSrvc #947

Merged
merged 6 commits into from
Apr 25, 2024

Conversation

ryoqun
Copy link
Collaborator

@ryoqun ryoqun commented Apr 21, 2024

Problem

Unified scheduler has a deadlock bug, which could happen due to some rare conditions (practically, it takes 2-3 days to hit this when running against mb).

Specifically, Arc<RwLock<BankForks>> can be deadlocked between solReplayStage and solScHandlerNN.

solReplayStage:
  1. solReplayStage write-locks BankForks during the entire duration of BankForks::set_root() invocation:

agave/core/src/replay_stage.rs

Lines 4139 to 4143 in 0b9c637

let removed_banks = bank_forks.write().unwrap().set_root(
new_root,
accounts_background_request_sender,
highest_super_majority_root,
);

pub fn set_root(
&mut self,
root: Slot,
accounts_background_request_sender: &AbsRequestSender,
highest_super_majority_root: Option<Slot>,
) -> Vec<Arc<Bank>> {

  1. BankForks::set_root() ultimately calls BankForks::remove() to prune it by discarding minor forks:

self.prune_non_rooted(root, highest_super_majority_root);

pub fn remove(&mut self, slot: Slot) -> Option<Arc<Bank>> {

  1. Finally, BankWithScheduler::drop() is implicitly called via BankWithScheduler::clone_without_scheduler() inside BankForks::remove(), which could start the blocking (side note: this is intentional to avoid resource leak) scheduler termination, if all scheduled task isn't finished to be executed yet, like for abandoned blocks with too few ticks of a minor fork (note that usually, the scheduler should have been waited elsewhere; drop is the last resort):

Some(bank.clone_without_scheduler())

  1. Now, the replay stage is thread is waiting on scheduler termination indefinitely while holding the write-lock of BankForks.
solScHandlerNN:

While the replay stage is already waiting for the completion of a given bank, it's possible that the unified scheduler is still running actively for the block. In that case, one of the handler thread will enter this call graph:

  1. TransactionBatchProcessor::load_and_execute_sanitized_transactions() calls into ProgramCache.

let programs_loaded_for_tx_batch = Rc::new(RefCell::new(self.replenish_program_cache(

let program_to_load = program_cache.extract(

  1. ProgramCache read-locks the BankForks to calls into for block relation information:

let locked_fork_graph = self.fork_graph.as_ref().unwrap().read().unwrap();

  1. Now, dead lock! To recap, the replay stage thread is waiting for scheduler termination. However, scheduler termination can't happen because the replaying stage is preventing further task handling towards termination.
gdb backtrace
Thread 369 (Thread 0x7ec88decf640 (LWP 562654) "solReplayStage"):
...
#17 solana_runtime::installed_scheduler_pool::BankWithSchedulerInner::drop_scheduler (self=0x7f3509306550) at runtime/src/installed_scheduler_pool.rs:394
...
#35 solana_runtime::bank_forks::BankForks::set_root (...) at runtime/src/bank_forks.rs:461
#36 0x000055b29e946de8 in solana_core::replay_stage::ReplayStage::handle_new_root (...) at core/src/replay_stage.rs:4139

Thread 663 (Thread 0x7ec66838e640 (LWP 563912) "solScHandler09"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000055b29e5ce3a1 in std::sys::unix::futex::futex_wait (futex=0x7f351042d610, expected=2147483647, timeout=...) at src/sys/unix/futex.rs:62
#2  std::sys::unix::locks::futex_rwlock::RwLock::read_contended (self=0x7f351042d610) at src/sys/unix/locks/futex_rwlock.rs:137
#3  0x000055b29f904f3a in std::sys::unix::locks::futex_rwlock::RwLock::read (self=0x7f351042d610) at /home/sol/.rustup/toolchains/nightly-2024-01-05-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_rwlock.rs:86
#4  std::sync::rwlock::RwLock<solana_runtime::bank_forks::BankForks>::read<solana_runtime::bank_forks::BankForks> (self=<optimized out>) at /home/sol/.rustup/toolchains/nightly-2024-01-05-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/rwlock.rs:210
#5  solana_program_runtime::loaded_programs::ProgramCache<solana_runtime::bank_forks::BankForks>::extract<solana_runtime::bank_forks::BankForks> (self=0x7f35103f7920, search_for=0x7ec66838ae28, is_first_round=true, loaded_programs_for_tx_batch=<optimized out>) at program-runtime/src/loaded_programs.rs:935
#6  solana_svm::transaction_processor::TransactionBatchProcessor<solana_runtime::bank_forks::BankForks>::replenish_program_cache<solana_runtime::bank_forks::BankForks, solana_runtime::bank::Bank> (self=0x7f34f1e67ca8, callback=0x7f34f1e67c10, program_accounts_map=0x7ec66838abe0, limit_to_load_programs=<optimized out>) at svm/src/transaction_processor.rs:544
#7  solana_svm::transaction_processor::TransactionBatchProcessor<solana_runtime::bank_forks::BankForks>::load_and_execute_sanitized_transactions<solana_runtime::bank_forks::BankForks, solana_runtime::bank::Bank, std::collections::hash::set::Iter<solana_program::pubkey::Pubkey>> (self=0x7f34f1e67ca8, callbacks=0x7f34f1e67c10, sanitized_txs=..., check_results=..., error_counters=0x7ec66838b0d0, recording_config=..., timings=0x7f3503e83220, account_overrides=..., builtin_programs=..., log_messages_bytes_limit=..., limit_to_load_programs=<optimized out>) at svm/src/transaction_processor.rs:226

Remarks

Originally, i thought it's better to avoid BankWithScheduler is spread far and wide across the codebase (solana-labs#33704), thinking transaction exectuion shouldn't need to use bank forks. However, that's not the case anymore with recent developemnt of ProgramCache, which is vital for transaction execution, yet fork-aware (for its own good reasons).

Note that the blockstore processor doesn't have this deadlock. this deadlock can happen only with unified scheduler because the nature of unified scheduler is async (for perf reason)....

Lastly, there's 2 other places where the replay stage is blocked on scheduler termination. Both of these places doesn't hold locks like this.

Summary of Changes

Fix the deadlock by delegating waiting for scheduler to the DropBankService by passing around BankWithScheduler, instead of Arc<Bank>.

context: extracted from #593

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 98.91304% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (6aacbf3) to head (44a5ea5).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #947     +/-   ##
=========================================
- Coverage    81.9%    81.8%   -0.1%     
=========================================
  Files         853      853             
  Lines      231812   231896     +84     
=========================================
+ Hits       189867   189885     +18     
- Misses      41945    42011     +66     

@ryoqun ryoqun requested a review from apfitzge April 22, 2024 06:47
@ryoqun ryoqun marked this pull request as ready for review April 22, 2024 06:49
@ryoqun ryoqun force-pushed the unified-scheduler-deadlock branch from ba5c80f to 07959f2 Compare April 22, 2024 15:00
};

#[test]
fn test_scheduler_waited_by_drop_bank_service() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running this test without the fix reliably causes deadlock.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to nit-pick but this test really does not seem in the right spot to me.
it is not really testing the behavior of the DropBankService, it's testing the response of the scheduler, right?

Correct me if I'm wrong, but the fix for this was just taking ownership of the banks we want to drop for a bit longer, and then dropping them outside the write-lock of bank_forks, right?
So seems this test either belongs in scheduling (pool?) code, or maybe bank_forks?

Copy link
Collaborator Author

@ryoqun ryoqun Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but the fix for this was just taking ownership of the banks we want to drop for a bit longer, and then dropping them outside the write-lock of bank_forks, right?

this is all correct.

So seems this test either belongs in scheduling (pool?) code, or maybe bank_forks?

Semantically, yes. But I actually wanted to use DropBankService in the test. so solana-unified-scheduler and solana-runtime aren't right crates... (i.e. DropBankService is defined at solana-core.

it is not really testing the behavior of the DropBankService

Strictly speaking, it's testing the behavior of DropBankService, regarding dropping DropWithSchdeduler. However, as you said, it's not that important to who is dropping it as long as it's outside the write-lock of bank_forks.

That said, I think i've moved the test into more natural place: 2c5d444 (after all, this is kind of integration test)

@@ -268,7 +268,7 @@ impl BankForks {
if entry.get().is_empty() {
entry.remove_entry();
}
Some(bank.clone_without_scheduler())
Copy link
Collaborator Author

@ryoqun ryoqun Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fun fact: this is the single actual code change to fix the deadlock, excluding all the type, comment, logging, test code changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So instead of dropping the scheduler part of the bank, we are just taking ownership of the bank with its scheduler.

@@ -268,7 +268,7 @@ impl BankForks {
if entry.get().is_empty() {
entry.remove_entry();
}
Some(bank.clone_without_scheduler())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So instead of dropping the scheduler part of the bank, we are just taking ownership of the bank with its scheduler.

};

#[test]
fn test_scheduler_waited_by_drop_bank_service() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to nit-pick but this test really does not seem in the right spot to me.
it is not really testing the behavior of the DropBankService, it's testing the response of the scheduler, right?

Correct me if I'm wrong, but the fix for this was just taking ownership of the banks we want to drop for a bit longer, and then dropping them outside the write-lock of bank_forks, right?
So seems this test either belongs in scheduling (pool?) code, or maybe bank_forks?

@ryoqun ryoqun requested review from a team as code owners April 24, 2024 07:16
@ryoqun ryoqun requested a review from apfitzge April 24, 2024 07:31
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ryoqun ryoqun merged commit 8e331e1 into anza-xyz:master Apr 25, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants