Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save to Disk in Bio Thread - draft #1784

Draft
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

nitaicaro
Copy link
Contributor

@nitaicaro nitaicaro commented Feb 26, 2025

Introduction

This PR introduces a new feature that enables replicas to perform disk-based synchronization on a dedicated background thread (Bio thread). Benchmarking results demonstrate significant improvements in synchronization duration. In extreme cases, this optimization allows syncs that would have previously failed to succeed.

This is an early draft pull request, as requested by the maintainers, to allow for review of the overall structure and approach before the full implementation is completed.

Problem Statement

Some administrators prefer the disk-based full synchronization mode for replicas. This mode allows replicas to continue serving clients with data while downloading the RDB file.

Valkey's predominantly single-threaded nature creates a challenge: serving client read requests and saving data from the socket to disk are not truly concurrent operations. In practice, the replica alternates between processing client requests and replication data, leading to inefficient behavior and prolonged sync durations, especially under high load.

Proposed Solution

To address this, the solution offloads the task of downloading the RDB file from the socket to a background thread. This allows the main thread to focus exclusively on handling client read requests while the background thread handles communication with the primary.

Benchmarking Results

Potential for Improvement

In theory, this optimization can lead to unbounded improvement in sync duration. By eliminating competition between client read events and socket communication (i.e., events related to handling RDB download with the primary), sync times become independent on load - the main thread handles only client reads, while the background thread focuses on primary RDB download events, allowing the system to perform consistently even under high load.

The full valkey-benchmark commands can be found in the appendix below.

Sync Duration with Feature Disabled (times in seconds)

16 threads, 64 clients: 172 seconds
32 threads, 128 clients: 436 seconds
48 threads, 192 clients: 710 seconds

Sync Duration with Feature Enabled (times in seconds)

16 threads, 64 clients: 33 seconds (80.8% improvement)
32 threads, 128 clients: 33 seconds (92.4% improvement)
48 threads, 192 clients: 33 seconds (95.3% improvement)

image

Alternative Solutions Considered

IO Threads
IO threads to not have an advantage over Bio in this case: The save-to-disk job is rare (most likely no more than several executions in a replica's lifetime), and there is never more than one simultaneous execution. Bio threads make more sense for a single, slow long running operation.

io_uring
For a single connection, io_uring doesn't provide as much of a performance boost because the primary advantage comes from batching many I/O operations together to reduce syscall overhead. With just one connection, we won't have enough operations to benefit significantly from these optimizations.

Prioritizing primary's socket in the event loop
This approach would help, but less effectively than using a Bio thread. We would still need to allocate attention to handling read requests, which could limit its benefit. It could be more useful on smaller instance types with limited CPU cores.

Appendix:

Benchmarking Setup

  • Client machine: AWS c5a.16xlarge
  • Server machines: AWS c5a.2xlarge
# Step 1: Fill the primary and replica DBs with 6GB of data:

./valkey-benchmark -h <host> -p <port> -l -d 128 -t set -r 30000000 --threads 16 -c 64

# Step 2: Initiate heavy read load on the replica:

./valkey-benchmark -h <host> -p <port> -t get -r 30000000 --threads <t> -c <t> -n 1000000000 -P <P>

# Step 3: Enable/disable the config controlling the new feature:

./valkey-cli -h <host> -p <port> config set replica-save-to-disk-in-bio-thread <yes/no>

# Step 4: Initiate sync:

./valkey-cli -h <replica host> -p <replica port> replicaof <primary host> <primary port>

@xbasel xbasel self-requested a review February 26, 2025 10:34
@nitaicaro nitaicaro changed the title save to disk in bio thread - draft Save to Disk in Bio Thread - draft Feb 26, 2025
bioSubmitJob(BIO_SAVE_TO_DISK, job);
}

long long cancelReplicationHandshakeTimeEvent(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to replace these with "done handlers" in the replication cron.

Copy link

codecov bot commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 83.64780% with 26 lines in your changes missing coverage. Please review.

Project coverage is 71.16%. Comparing base (b360f96) to head (7233cde).
Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/bio.c 84.21% 15 Missing ⚠️
src/replication.c 82.81% 11 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1784      +/-   ##
============================================
+ Coverage     71.12%   71.16%   +0.03%     
============================================
  Files           123      123              
  Lines         65610    65775     +165     
============================================
+ Hits          46664    46807     +143     
- Misses        18946    18968      +22     
Files with missing lines Coverage Δ
src/config.c 78.39% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/replication.c 86.08% <82.81%> (-0.94%) ⬇️
src/bio.c 84.34% <84.21%> (+0.64%) ⬆️

... and 15 files with indirect coverage changes

@nitaicaro nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch from 683b81e to 7233cde Compare February 26, 2025 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant