Really start broadcast stage caching by fixing swapped CAS... #18842

ryoqun · 2021-07-22T17:48:28Z

Problem

#10373 had rather a fatal bug, effectively not enabling caching at all...

Summary of Changes

really enable caching by swapping argument order of ::compare_and_swap.

context: this finding originated at #18776. but this single line should do well, comparable to the hackish async solution, which needs polishing like #18776 (comment)

ryoqun · 2021-07-22T17:51:56Z

core/src/broadcast_stage/standard_broadcast_run.rs

@@ -344,7 +344,7 @@ impl StandardBroadcastRun {
        if now - last > BROADCAST_PEER_UPDATE_INTERVAL_MS
            && self
                .last_peer_update
-                .compare_and_swap(now, last, Ordering::Relaxed)


https://doc.rust-lang.org/std/sync/atomic/struct.AtomicUsize.html#method.compare_and_swap:

pub fn compare_and_swap( &self, current: usize, new: usize, order: Ordering ) -> usize Use compare_exchange or compare_exchange_weak instead Stores a value into the atomic integer if the current value is the same as the current value.

Notice we're passing now as current, which means cas never succeeds and .compare_and_swap() is returning last value always, then if always evaluating to true here (kicking the ClusterNodes calc.

ryoqun · 2021-07-22T17:58:39Z

core/src/broadcast_stage/standard_broadcast_run.rs

@@ -344,7 +344,7 @@ impl StandardBroadcastRun {
        if now - last > BROADCAST_PEER_UPDATE_INTERVAL_MS
            && self
                .last_peer_update
-                .compare_and_swap(now, last, Ordering::Relaxed)
+                .compare_and_swap(last, now, Ordering::Relaxed)
                == last
        {
            *self.cluster_nodes.write().unwrap() = ClusterNodes::<BroadcastStage>::new(


according to this metrics, this iteration takes 2-3 ms (excluding spikes) on testnet. so there should be no immediate need for async (separate thread/service):

code: https://github.com/solana-labs/solana/pull/18776/files#r675045904

sakridge · 2021-07-22T18:02:04Z

Doh. The metrics make a lot more sense now. Great find. Should also be fixed by: #18808

This is a bit more conservative fix though, so better to go with this small fix.

codecov · 2021-07-22T21:14:36Z

Codecov Report

Merging #18842 (4a5f9a1) into master (7fc4cfe) will increase coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##           master   #18842   +/-   ##
=======================================
  Coverage    82.8%    82.8%           
=======================================
  Files         443      443           
  Lines      125416   125416           
=======================================
+ Hits       103864   103871    +7     
+ Misses      21552    21545    -7

(cherry picked from commit 611af87)

(cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>

…lana-labs#18853) (cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>

ryoqun · 2021-07-23T12:34:25Z

these are sampled metrics who updated this pr ahead of official v1.7.7 ann.:

Overall, I'm seeing improved numbers across directly affected ones, still not sure how this contributes to the cluster-wide skip rate, though... ;)

SELECT sum("end_to_end_elapsed") AS "mean_end_to_end_elapsed" FROM "tds"."autogen"."broadcast-transmit-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("host_id"='2ZqaTLm1TZfpYdR7d8XhRntPjmrF6q69YZVLb6j4GcWz' OR "host_id"='5EgBAoCdu6r5BcrntpAW2rm5j77nSxJeD7oMDS927hxq' OR "host_id"='5dB4Ygb8Sf3Sssdxxrpbb4NFX9bMrYnieiz11Vr5xJkJ' OR "host_id"='7TcmJn12spW6KQJp4fvvo45d1hpxS8EnLjKMxihtNZ1V' OR "host_id"='9CYnw2VNWfipQiDKEjgmZsh36xTmDBcSu93mCfSvMRpc' OR "host_id"='Gmw9GarCUcQNYnqePXNBREuLhcMUwXhQWZMAvxSUf6c2' OR "host_id"='Hy5Mano3fc6AZceKCCwooL1sb55KaucRTGAMxRxsZ6qL' OR "host_id"='J7v9ndmcoBuo9to2MnHegLnBkC9x3SAVbQBJo5MMJrN1') GROUP BY time(:interval:), "host_id" FILL(0)

SELECT sum("transmit_elapsed") AS "mean_transmit_elapsed" FROM "tds"."autogen"."broadcast-transmit-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("host_id"='2ZqaTLm1TZfpYdR7d8XhRntPjmrF6q69YZVLb6j4GcWz' OR "host_id"='5EgBAoCdu6r5BcrntpAW2rm5j77nSxJeD7oMDS927hxq' OR "host_id"='5dB4Ygb8Sf3Sssdxxrpbb4NFX9bMrYnieiz11Vr5xJkJ' OR "host_id"='7TcmJn12spW6KQJp4fvvo45d1hpxS8EnLjKMxihtNZ1V' OR "host_id"='9CYnw2VNWfipQiDKEjgmZsh36xTmDBcSu93mCfSvMRpc' OR "host_id"='Gmw9GarCUcQNYnqePXNBREuLhcMUwXhQWZMAvxSUf6c2' OR "host_id"='Hy5Mano3fc6AZceKCCwooL1sb55KaucRTGAMxRxsZ6qL' OR "host_id"='J7v9ndmcoBuo9to2MnHegLnBkC9x3SAVbQBJo5MMJrN1') GROUP BY time(:interval:), "host_id" FILL(0)

SELECT sum("get_peers_elapsed") AS "mean_get_peers_elapsed" FROM "tds"."autogen"."broadcast-transmit-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("host_id"='2ZqaTLm1TZfpYdR7d8XhRntPjmrF6q69YZVLb6j4GcWz' OR "host_id"='5EgBAoCdu6r5BcrntpAW2rm5j77nSxJeD7oMDS927hxq' OR "host_id"='5dB4Ygb8Sf3Sssdxxrpbb4NFX9bMrYnieiz11Vr5xJkJ' OR "host_id"='7TcmJn12spW6KQJp4fvvo45d1hpxS8EnLjKMxihtNZ1V' OR "host_id"='9CYnw2VNWfipQiDKEjgmZsh36xTmDBcSu93mCfSvMRpc' OR "host_id"='Gmw9GarCUcQNYnqePXNBREuLhcMUwXhQWZMAvxSUf6c2' OR "host_id"='Hy5Mano3fc6AZceKCCwooL1sb55KaucRTGAMxRxsZ6qL' OR "host_id"='J7v9ndmcoBuo9to2MnHegLnBkC9x3SAVbQBJo5MMJrN1') GROUP BY time(:interval:), "host_id" FILL(0)

(cherry picked from commit 611af87)

(cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>

ryoqun · 2021-08-19T18:31:11Z

for the record:

there is clearly improved pipelining:

mainnet-beta nodes with v1.6.21 is clearly starting to broadcast as soon as shreds are created as seen by other nodes' retransmission initiation ticks.

mainnet-beta nodes with v1.6.20 is delaying block propagation needlessly, choking other validators and next leaders.

ryoqun · 2021-09-02T12:02:55Z

it seems skip rate is all time low these days (due to slow upgrade across cluster; I couldn't do proper eval):

Really start caching by fixing swapped CAS...

4a5f9a1

ryoqun changed the title ~~Really start caching by fixing swapped CAS...~~ Really start broadcast stage caching by fixing swapped CAS... Jul 22, 2021

ryoqun commented Jul 22, 2021

View reviewed changes

ryoqun added the v1.7 label Jul 22, 2021

ryoqun mentioned this pull request Jul 22, 2021

Cache tvu peers for broadcast #10373

Merged

ryoqun requested a review from sakridge July 22, 2021 18:03

sakridge approved these changes Jul 22, 2021

View reviewed changes

ryoqun merged commit 611af87 into solana-labs:master Jul 23, 2021

mergify bot pushed a commit that referenced this pull request Jul 23, 2021

Really start caching by fixing swapped CAS... (#18842)

1605f30

(cherry picked from commit 611af87)

mergify bot mentioned this pull request Jul 23, 2021

Really start broadcast stage caching by fixing swapped CAS... (backport #18842) #18853

Merged

mergify bot added a commit that referenced this pull request Jul 23, 2021

Really start caching by fixing swapped CAS... (#18842) (#18853)

cc58d36

(cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>

ryoqun added a commit to ryoqun/solana that referenced this pull request Jul 23, 2021

Really start caching by fixing swapped CAS... (solana-labs#18842) (so…

4464d25

…lana-labs#18853) (cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>

ryoqun added the v1.6 label Jul 29, 2021

mergify bot pushed a commit that referenced this pull request Jul 29, 2021

Really start caching by fixing swapped CAS... (#18842)

87f0510

(cherry picked from commit 611af87)

mergify bot mentioned this pull request Jul 29, 2021

Really start broadcast stage caching by fixing swapped CAS... (backport #18842) #18969

Merged

mergify bot added a commit that referenced this pull request Jul 29, 2021

Really start caching by fixing swapped CAS... (#18842) (#18969)

40fc144

(cherry picked from commit 611af87) Co-authored-by: Ryo Onodera <[email protected]>

This was referenced Aug 23, 2021

backport 19361 v16 #19379

Closed

backport 19361 v17 #19380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Really start broadcast stage caching by fixing swapped CAS... #18842

Really start broadcast stage caching by fixing swapped CAS... #18842

ryoqun commented Jul 22, 2021

ryoqun Jul 22, 2021

ryoqun Jul 22, 2021 •

edited

Loading

sakridge commented Jul 22, 2021

codecov bot commented Jul 22, 2021

ryoqun commented Jul 23, 2021

ryoqun commented Aug 19, 2021

ryoqun commented Sep 2, 2021

Really start broadcast stage caching by fixing swapped CAS... #18842

Really start broadcast stage caching by fixing swapped CAS... #18842

Conversation

ryoqun commented Jul 22, 2021

Problem

Summary of Changes

ryoqun Jul 22, 2021

Choose a reason for hiding this comment

ryoqun Jul 22, 2021 • edited Loading

Choose a reason for hiding this comment

sakridge commented Jul 22, 2021

codecov bot commented Jul 22, 2021

Codecov Report

ryoqun commented Jul 23, 2021

ryoqun commented Aug 19, 2021

ryoqun commented Sep 2, 2021

ryoqun Jul 22, 2021 •

edited

Loading