Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Really start broadcast stage caching by fixing swapped CAS... #18842

Merged
merged 1 commit into from
Jul 23, 2021

Conversation

ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Jul 22, 2021

Problem

#10373 had rather a fatal bug, effectively not enabling caching at all...

Summary of Changes

really enable caching by swapping argument order of ::compare_and_swap.

context: this finding originated at #18776. but this single line should do well, comparable to the hackish async solution, which needs polishing like #18776 (comment)

@ryoqun ryoqun changed the title Really start caching by fixing swapped CAS... Really start broadcast stage caching by fixing swapped CAS... Jul 22, 2021
@@ -344,7 +344,7 @@ impl StandardBroadcastRun {
if now - last > BROADCAST_PEER_UPDATE_INTERVAL_MS
&& self
.last_peer_update
.compare_and_swap(now, last, Ordering::Relaxed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/std/sync/atomic/struct.AtomicUsize.html#method.compare_and_swap:

pub fn compare_and_swap(
    &self,
    current: usize,
    new: usize,
    order: Ordering
) -> usize

Use compare_exchange or compare_exchange_weak instead

Stores a value into the atomic integer if the current value is the same as the current value.

Notice we're passing now as current, which means cas never succeeds and .compare_and_swap() is returning last value always, then if always evaluating to true here (kicking the ClusterNodes calc.

@@ -344,7 +344,7 @@ impl StandardBroadcastRun {
if now - last > BROADCAST_PEER_UPDATE_INTERVAL_MS
&& self
.last_peer_update
.compare_and_swap(now, last, Ordering::Relaxed)
.compare_and_swap(last, now, Ordering::Relaxed)
== last
{
*self.cluster_nodes.write().unwrap() = ClusterNodes::<BroadcastStage>::new(
Copy link
Contributor Author

@ryoqun ryoqun Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to this metrics, this iteration takes 2-3 ms (excluding spikes) on testnet. so there should be no immediate need for async (separate thread/service):

image

code: https://github.com/solana-labs/solana/pull/18776/files#r675045904

@sakridge
Copy link
Contributor

Doh. The metrics make a lot more sense now. Great find. Should also be fixed by: #18808

This is a bit more conservative fix though, so better to go with this small fix.

@ryoqun ryoqun requested a review from sakridge July 22, 2021 18:03
@codecov
Copy link

codecov bot commented Jul 22, 2021

Codecov Report

Merging #18842 (4a5f9a1) into master (7fc4cfe) will increase coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##           master   #18842   +/-   ##
=======================================
  Coverage    82.8%    82.8%           
=======================================
  Files         443      443           
  Lines      125416   125416           
=======================================
+ Hits       103864   103871    +7     
+ Misses      21552    21545    -7     

@ryoqun ryoqun merged commit 611af87 into solana-labs:master Jul 23, 2021
mergify bot pushed a commit that referenced this pull request Jul 23, 2021
mergify bot added a commit that referenced this pull request Jul 23, 2021
(cherry picked from commit 611af87)

Co-authored-by: Ryo Onodera <[email protected]>
ryoqun added a commit to ryoqun/solana that referenced this pull request Jul 23, 2021
@ryoqun
Copy link
Contributor Author

ryoqun commented Jul 23, 2021

these are sampled metrics who updated this pr ahead of official v1.7.7 ann.:

Overall, I'm seeing improved numbers across directly affected ones, still not sure how this contributes to the cluster-wide skip rate, though... ;)

image

SELECT sum("end_to_end_elapsed") AS "mean_end_to_end_elapsed" FROM "tds"."autogen"."broadcast-transmit-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("host_id"='2ZqaTLm1TZfpYdR7d8XhRntPjmrF6q69YZVLb6j4GcWz' OR "host_id"='5EgBAoCdu6r5BcrntpAW2rm5j77nSxJeD7oMDS927hxq' OR "host_id"='5dB4Ygb8Sf3Sssdxxrpbb4NFX9bMrYnieiz11Vr5xJkJ' OR "host_id"='7TcmJn12spW6KQJp4fvvo45d1hpxS8EnLjKMxihtNZ1V' OR "host_id"='9CYnw2VNWfipQiDKEjgmZsh36xTmDBcSu93mCfSvMRpc' OR "host_id"='Gmw9GarCUcQNYnqePXNBREuLhcMUwXhQWZMAvxSUf6c2' OR "host_id"='Hy5Mano3fc6AZceKCCwooL1sb55KaucRTGAMxRxsZ6qL' OR "host_id"='J7v9ndmcoBuo9to2MnHegLnBkC9x3SAVbQBJo5MMJrN1') GROUP BY time(:interval:), "host_id" FILL(0)

image

SELECT sum("transmit_elapsed") AS "mean_transmit_elapsed" FROM "tds"."autogen"."broadcast-transmit-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("host_id"='2ZqaTLm1TZfpYdR7d8XhRntPjmrF6q69YZVLb6j4GcWz' OR "host_id"='5EgBAoCdu6r5BcrntpAW2rm5j77nSxJeD7oMDS927hxq' OR "host_id"='5dB4Ygb8Sf3Sssdxxrpbb4NFX9bMrYnieiz11Vr5xJkJ' OR "host_id"='7TcmJn12spW6KQJp4fvvo45d1hpxS8EnLjKMxihtNZ1V' OR "host_id"='9CYnw2VNWfipQiDKEjgmZsh36xTmDBcSu93mCfSvMRpc' OR "host_id"='Gmw9GarCUcQNYnqePXNBREuLhcMUwXhQWZMAvxSUf6c2' OR "host_id"='Hy5Mano3fc6AZceKCCwooL1sb55KaucRTGAMxRxsZ6qL' OR "host_id"='J7v9ndmcoBuo9to2MnHegLnBkC9x3SAVbQBJo5MMJrN1') GROUP BY time(:interval:), "host_id" FILL(0)

image

SELECT sum("get_peers_elapsed") AS "mean_get_peers_elapsed" FROM "tds"."autogen"."broadcast-transmit-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("host_id"='2ZqaTLm1TZfpYdR7d8XhRntPjmrF6q69YZVLb6j4GcWz' OR "host_id"='5EgBAoCdu6r5BcrntpAW2rm5j77nSxJeD7oMDS927hxq' OR "host_id"='5dB4Ygb8Sf3Sssdxxrpbb4NFX9bMrYnieiz11Vr5xJkJ' OR "host_id"='7TcmJn12spW6KQJp4fvvo45d1hpxS8EnLjKMxihtNZ1V' OR "host_id"='9CYnw2VNWfipQiDKEjgmZsh36xTmDBcSu93mCfSvMRpc' OR "host_id"='Gmw9GarCUcQNYnqePXNBREuLhcMUwXhQWZMAvxSUf6c2' OR "host_id"='Hy5Mano3fc6AZceKCCwooL1sb55KaucRTGAMxRxsZ6qL' OR "host_id"='J7v9ndmcoBuo9to2MnHegLnBkC9x3SAVbQBJo5MMJrN1') GROUP BY time(:interval:), "host_id" FILL(0)

@ryoqun ryoqun added the v1.6 label Jul 29, 2021
mergify bot pushed a commit that referenced this pull request Jul 29, 2021
mergify bot added a commit that referenced this pull request Jul 29, 2021
(cherry picked from commit 611af87)

Co-authored-by: Ryo Onodera <[email protected]>
@ryoqun
Copy link
Contributor Author

ryoqun commented Aug 19, 2021

for the record:

there is clearly improved pipelining:

mainnet-beta nodes with v1.6.21 is clearly starting to broadcast as soon as shreds are created as seen by other nodes' retransmission initiation ticks.

mainnet-beta nodes with v1.6.20 is delaying block propagation needlessly, choking other validators and next leaders.

image

This was referenced Aug 23, 2021
@ryoqun
Copy link
Contributor Author

ryoqun commented Sep 2, 2021

it seems skip rate is all time low these days (due to slow upgrade across cluster; I couldn't do proper eval):

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants