-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Really start broadcast stage caching by fixing swapped CAS... #18842
Conversation
@@ -344,7 +344,7 @@ impl StandardBroadcastRun { | |||
if now - last > BROADCAST_PEER_UPDATE_INTERVAL_MS | |||
&& self | |||
.last_peer_update | |||
.compare_and_swap(now, last, Ordering::Relaxed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/std/sync/atomic/struct.AtomicUsize.html#method.compare_and_swap:
pub fn compare_and_swap(
&self,
current: usize,
new: usize,
order: Ordering
) -> usize
Use compare_exchange or compare_exchange_weak instead
Stores a value into the atomic integer if the current value is the same as the current value.
Notice we're passing now
as current
, which means cas never succeeds and .compare_and_swap()
is returning last
value always, then if
always evaluating to true
here (kicking the ClusterNodes calc.
@@ -344,7 +344,7 @@ impl StandardBroadcastRun { | |||
if now - last > BROADCAST_PEER_UPDATE_INTERVAL_MS | |||
&& self | |||
.last_peer_update | |||
.compare_and_swap(now, last, Ordering::Relaxed) | |||
.compare_and_swap(last, now, Ordering::Relaxed) | |||
== last | |||
{ | |||
*self.cluster_nodes.write().unwrap() = ClusterNodes::<BroadcastStage>::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to this metrics, this iteration takes 2-3 ms (excluding spikes) on testnet. so there should be no immediate need for async (separate thread/service):
code: https://github.com/solana-labs/solana/pull/18776/files#r675045904
Doh. The metrics make a lot more sense now. Great find. Should also be fixed by: #18808 This is a bit more conservative fix though, so better to go with this small fix. |
Codecov Report
@@ Coverage Diff @@
## master #18842 +/- ##
=======================================
Coverage 82.8% 82.8%
=======================================
Files 443 443
Lines 125416 125416
=======================================
+ Hits 103864 103871 +7
+ Misses 21552 21545 -7 |
(cherry picked from commit 611af87)
(cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>
…lana-labs#18853) (cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>
these are sampled metrics who updated this pr ahead of official v1.7.7 ann.: Overall, I'm seeing improved numbers across directly affected ones, still not sure how this contributes to the cluster-wide skip rate, though... ;)
|
(cherry picked from commit 611af87)
(cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>
for the record: there is clearly improved pipelining: mainnet-beta nodes with v1.6.21 is clearly starting to broadcast as soon as shreds are created as seen by other nodes' retransmission initiation ticks. mainnet-beta nodes with v1.6.20 is delaying block propagation needlessly, choking other validators and next leaders. |
Problem
#10373 had rather a fatal bug, effectively not enabling caching at all...
Summary of Changes
really enable caching by swapping argument order of
::compare_and_swap
.context: this finding originated at #18776. but this single line should do well, comparable to the hackish async solution, which needs polishing like #18776 (comment)