-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scheduler deadlock by waiting at DropBankSrvc #947
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #947 +/- ##
=========================================
- Coverage 81.9% 81.8% -0.1%
=========================================
Files 853 853
Lines 231812 231896 +84
=========================================
+ Hits 189867 189885 +18
- Misses 41945 42011 +66 |
ba5c80f
to
07959f2
Compare
core/src/drop_bank_service.rs
Outdated
}; | ||
|
||
#[test] | ||
fn test_scheduler_waited_by_drop_bank_service() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running this test without the fix reliably causes deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to nit-pick but this test really does not seem in the right spot to me.
it is not really testing the behavior of the DropBankService, it's testing the response of the scheduler, right?
Correct me if I'm wrong, but the fix for this was just taking ownership of the banks we want to drop for a bit longer, and then dropping them outside the write-lock of bank_forks, right?
So seems this test either belongs in scheduling (pool?) code, or maybe bank_forks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but the fix for this was just taking ownership of the banks we want to drop for a bit longer, and then dropping them outside the write-lock of bank_forks, right?
this is all correct.
So seems this test either belongs in scheduling (pool?) code, or maybe bank_forks?
Semantically, yes. But I actually wanted to use DropBankService
in the test. so solana-unified-scheduler
and solana-runtime
aren't right crates... (i.e. DropBankService
is defined at solana-core
.
it is not really testing the behavior of the DropBankService
Strictly speaking, it's testing the behavior of DropBankService, regarding dropping DropWithSchdeduler
. However, as you said, it's not that important to who is dropping it as long as it's outside the write-lock of bank_forks.
That said, I think i've moved the test into more natural place: 2c5d444 (after all, this is kind of integration test)
@@ -268,7 +268,7 @@ impl BankForks { | |||
if entry.get().is_empty() { | |||
entry.remove_entry(); | |||
} | |||
Some(bank.clone_without_scheduler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fun fact: this is the single actual code change to fix the deadlock, excluding all the type, comment, logging, test code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So instead of dropping the scheduler part of the bank, we are just taking ownership of the bank with its scheduler.
@@ -268,7 +268,7 @@ impl BankForks { | |||
if entry.get().is_empty() { | |||
entry.remove_entry(); | |||
} | |||
Some(bank.clone_without_scheduler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So instead of dropping the scheduler part of the bank, we are just taking ownership of the bank with its scheduler.
core/src/drop_bank_service.rs
Outdated
}; | ||
|
||
#[test] | ||
fn test_scheduler_waited_by_drop_bank_service() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to nit-pick but this test really does not seem in the right spot to me.
it is not really testing the behavior of the DropBankService, it's testing the response of the scheduler, right?
Correct me if I'm wrong, but the fix for this was just taking ownership of the banks we want to drop for a bit longer, and then dropping them outside the write-lock of bank_forks, right?
So seems this test either belongs in scheduling (pool?) code, or maybe bank_forks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Problem
Unified scheduler has a deadlock bug, which could happen due to some rare conditions (practically, it takes 2-3 days to hit this when running against mb).
Specifically,
Arc<RwLock<BankForks>>
can be deadlocked betweensolReplayStage
andsolScHandlerNN
.solReplayStage
:solReplayStage
write-locksBankForks
during the entire duration ofBankForks::set_root()
invocation:agave/core/src/replay_stage.rs
Lines 4139 to 4143 in 0b9c637
agave/runtime/src/bank_forks.rs
Lines 453 to 458 in 0b9c637
BankForks::set_root()
ultimately callsBankForks::remove()
to prune it by discarding minor forks:agave/runtime/src/bank_forks.rs
Line 420 in 0b9c637
agave/runtime/src/bank_forks.rs
Line 246 in 0b9c637
BankWithScheduler::drop()
is implicitly called viaBankWithScheduler::clone_without_scheduler()
insideBankForks::remove()
, which could start the blocking (side note: this is intentional to avoid resource leak) scheduler termination, if all scheduled task isn't finished to be executed yet, like for abandoned blocks with too few ticks of a minor fork (note that usually, the scheduler should have been waited elsewhere; drop is the last resort):agave/runtime/src/bank_forks.rs
Line 263 in 0b9c637
BankForks
.solScHandlerNN
:While the replay stage is already waiting for the completion of a given bank, it's possible that the unified scheduler is still running actively for the block. In that case, one of the handler thread will enter this call graph:
TransactionBatchProcessor::load_and_execute_sanitized_transactions()
calls intoProgramCache
.agave/svm/src/transaction_processor.rs
Line 226 in 0b9c637
agave/svm/src/transaction_processor.rs
Line 544 in 0b9c637
ProgramCache
read-locks theBankForks
to calls into for block relation information:agave/program-runtime/src/loaded_programs.rs
Line 935 in 0b9c637
gdb backtrace
Remarks
Originally, i thought it's better to avoid
BankWithScheduler
is spread far and wide across the codebase (solana-labs#33704), thinking transaction exectuion shouldn't need to use bank forks. However, that's not the case anymore with recent developemnt ofProgramCache
, which is vital for transaction execution, yet fork-aware (for its own good reasons).Note that the blockstore processor doesn't have this deadlock. this deadlock can happen only with unified scheduler because the nature of unified scheduler is async (for perf reason)....
Lastly, there's 2 other places where the replay stage is blocked on scheduler termination. Both of these places doesn't hold locks like this.
Summary of Changes
Fix the deadlock by delegating waiting for scheduler to the
DropBankService
by passing aroundBankWithScheduler
, instead ofArc<Bank>
.context: extracted from #593