Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound gossip channels #5104

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

cpubot
Copy link

@cpubot cpubot commented Feb 28, 2025

Problem

The unbounded channel uses what is effectively a linked list of "blocks" of 31 slots. As volume surges, it has to dynamically allocate additional blocks to accommodate the extra slots required. It does this on demand, one-by-one, which adds extra synchronization overhead on the sending and receiving side in addition to the time waiting for allocation / deallocation -- this slows down the entire pipeline of packet processing. Contrast this to a bounded channel, which uses a single contiguous allocation without any of the block synchronization overhead.

Summary of Changes

Gossip packet buffers now used bounded channels, with a fixed, preallocated, capacity.

It was observed that under extreme load, the channel caps out at around 11k packet batches. The channel capacity introduced here is 16,384 (2^14) packet batches, which is the 11k figure rounded up to the next power of 2. This ensures that load shedding is highly unlikely to occur on the sender and continues to be done on the receiver side.

Senders are also updated to use try_send and introduce error handling in the event that channels are full. It is highly unlikely that these errors are actually hit in practice given the large capacity of the channels and the draining that is occurring in the receivers. This adjustment is made here for correctness and completeness, and also dovetail with forthcoming updates to load shed at the send site.

To reiterate, there is no practical change to current load shedding behavior in this branch, save for edge cases that were never observed in load testing and are highly unlikely to be encountered.

Broken out of #5065

Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+r network team approval

@gregcusack
Copy link

pr looks good but i am running some tests to get a better understanding of the behavior

@@ -71,6 +72,7 @@ impl StreamerReceiveStats {
packet_batches_count: AtomicUsize::default(),
full_packet_batches_count: AtomicUsize::default(),
max_channel_len: AtomicUsize::default(),
num_packets_dropped: AtomicUsize::default(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: is it time to start reporting StreamerReceiveStats? or is there a reason we don't actually report these stats.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly include stat reporting in this PR.

As far as I can tell, when the stats component of streamer was added, reporting of those stats was not added to all usage sites, presumably to satisfy type checking of that change without making assumptions about reporting. I.e., I don't think we're deliberately not reporting stats

@gregcusack gregcusack self-requested a review March 1, 2025 06:03
@cpubot cpubot force-pushed the bound-gossip-channels branch from 9c85fbd to cb9bae8 Compare March 1, 2025 19:08
@sakridge
Copy link

sakridge commented Mar 3, 2025

+1 from me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants