Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Prevent endless replication loop by throttling write commands #1649

Open
lschmidtcavalcante-sc opened this issue Jan 30, 2025 · 23 comments

Comments

@lschmidtcavalcante-sc
Copy link

lschmidtcavalcante-sc commented Jan 30, 2025

The problem/use-case that the feature addresses

The problem of endless replication loop is described here https://redis.io/blog/the-endless-redis-replication-loop-what-why-and-how-to-solve-it/, but it can be summarized as: in cases where the replica is slow to consume the replication buffer (for example, busy retrieving the initial snapshot of the database), said buffer can overflow triggering the synchronization process to start from scratch again.

Description of the feature

Implement a feature flag (say write-throttling) that once enabled evaluates each incoming write request:

  • If the incoming write request will cause an overflow of the replication buffer, then drop it and return THROTTLED error;
  • If the incoming write request will not cause an overflow of the replication buffer, then add to the replication buffer and proceed as usual.

This way the endless replication loop is avoided at the cost of dropping some requests.

Alternatives you've considered

There are currently two solutions to endless replication loop:

  1. Reduce the traffic so the buffer doesn't overflow (either cut the traffic or wait for non-peak hours)
  2. Increase the replication buffer size, giving more time for the replica to catch up.

These solutions will continue to exist and the proposed write-throttling is not expected to be enabled by default.

@hpatro
Copy link
Collaborator

hpatro commented Jan 30, 2025

@lschmidtcavalante-sc Thanks for starting this issue. I think operationally it makes sense to introduce a mechanism to avoid the endless replication loop. I've a few questions on this.

  • Do you envision this as a boolean feature flag?
  • Could it cause complete write outage? (this might be a bad/unexpected experience for unfamiliar users)
  • Can't the client be blocked until the full sync is over instead of introducing a new THROTTLED response. Clients are generally very slow to catchup with these changes.

@lschmidtcavalcante-sc
Copy link
Author

  1. Yes, it would be a boolean flag
  2. I can only envision this scenario happening if there is a slow replica that doesn't consume its replication buffer and it also never disconnects. Nonetheless, I don't expect this feature to be enabled by default.
  3. By client you mean the caller of Valkey? If so, then this solution is good because it doesn't rely on their behavior to THROTTLED (or another error stating that we are close to being full), we are simply not going to write their requests to the replication buffer regardless of their retries

@ovaiskhansc
Copy link

@hpatro just wanted to add to point # 2 above that we use this functionality when ValKey is used as a cache and in that context a write failure is preferred instead of a OOM crash.

@hpatro
Copy link
Collaborator

hpatro commented Jan 31, 2025

  1. By client you mean the caller of Valkey? If so, then this solution is good because it doesn't rely on their behavior to THROTTLED (or another error stating that we are close to being full), we are simply not going to write their requests to the replication buffer regardless of their retries

Yeah. the blocking mechanism already exists in the system so we could leverage that.

@lschmidtcavalcante-sc
Copy link
Author

By "this solution is good" I meant this proposal (write-throttling), as it doesn't rely on the client being well-behaved or not.

@hpatro
Copy link
Collaborator

hpatro commented Jan 31, 2025

By "this solution is good" I meant this proposal (write-throttling), as it doesn't rely on the client being well-behaved or not.

That's right but in the same place we don't want to make changes where client is unable to handle the server response.

@lschmidtcavalcante-sc
Copy link
Author

lschmidtcavalcante-sc commented Jan 31, 2025

Can you point me to the implementation of this client blocking mechanism (both server and client side)?

PS: Also, it is my goal to contribute CPU throttling to Valkey and it would re-use this THROTTLE error

@hpatro
Copy link
Collaborator

hpatro commented Jan 31, 2025

Can you point me to the implementation of this client blocking mechanism (both server and client side)?

I couldn't find any documentation for it but here is the code link.
Server Side:
https://github.com/valkey-io/valkey/blob/unstable/src/server.h#L357-L368
https://github.com/valkey-io/valkey/blob/unstable/src/blocked.c

There is nothing much to do on client side.

PS: Also, it is my goal to contribute CPU throttling to Valkey and it would re-use this THROTTLE error

Sure. I've not wrapped around my head around THROTTLE error but I presume you're proposing on the lines of HTTP Retry-After https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429.

@madolson
Copy link
Member

madolson commented Feb 1, 2025

Sort of a duplicate of #872, but this one has some more traction so we can maybe use this one moving forward. (Most of the issue is actually here though redis/redis#12844)

I overall like this idea. As was mentioned, blocking is one such option, but doesn't work well for caching workloads. Although, typically caching workloads are doing mostly reads and not writes, so blocking the client may be a better approach here. We might independently want to add a new throttled error code that clients can opt-in to handling.

@madolson madolson added this to Roadmap Feb 1, 2025
@madolson madolson moved this to Idea in Roadmap Feb 1, 2025
@ovaiskhansc
Copy link

@madolson we already have this code in our branch and @lschmidtcavalcante-sc can send a proposed PR shortly if we are aligned on the new optional error code approach.

@madolson
Copy link
Member

madolson commented Feb 4, 2025

Yeah, please feel free to open the PR as a draft (doesn't need to be perfect) since I think that is a good place to finalize the discussion. I think we'll want to discuss client compatibility story as well.

@valkey-io/core-team Would also like your input about adding a new throttled or blocking code instead of disconnecting the replica*.

@zuiderkwast
Copy link
Contributor

I like the idea of throttling. If we make the config an enum instead of bool, we can make it open to add blocking (like CLIENT PAUSE WRITE) too later.

Other use of thottling can be CPU or the number of pending commands to handle.

@lschmidtcavalcante-sc
Copy link
Author

Shared proposed implementation at #1672

@lschmidtcavalcante-sc
Copy link
Author

I like the idea of throttling. If we make the config an enum instead of bool, we can make it open to add blocking (like CLIENT PAUSE WRITE) too later.

Other use of thottling can be CPU or the number of pending commands to handle.

For CPU throttling I started a new issue at #1688

@ovaiskhansc
Copy link

Shared proposed implementation at #1672

@madolson @zuiderkwast @hpatro what do you guys think about #1672 which adds a new "-THROTTLED" error to prevent primary node slowdowns and allow configuring caching clusters to throttle instead of fail.

@madolson
Copy link
Member

@ovaiskhansc We're in the process of releasing 8.1 release candidate, so we're a bit distracted about that. Once that's out the door we'll be a little more responsive about the PR. Appreciate the patience!

@ovaiskhansc
Copy link

No worries at all. 8.1 is an important release for us as we are planning to deploy that (+ these protections) at Snap

@PingXie
Copy link
Member

PingXie commented Feb 17, 2025

I am onboard with this proposal.

Generally speaking, I think blocking clients on replication backlog overflow is a breaking change. The convention has been encoding the blocking behavior explicitly in the command, such as BLPOP. Returning -THROTTLED would be more compatible IMO. Blocking could be opted-in on a per client basis via CLIENT CAPA however.

@hpatro
Copy link
Collaborator

hpatro commented Feb 17, 2025

My two cents: CLIENT CAPA shouldn't be overused otherwise it will be a nightmare pretty quickly (on server side with branching and on client side to figure out which behavior is supported in their valkey client library's version). I think we should pick a behavior for the client and enforce it from a major version. Blocking is one of the safest (if you consider it as a breaking change) amongst all the options where there is no additional new behavior to handle. Client could timeout and retry like they would have done earlier. However, this could cause connection storm if all the clients get disconnected.

@zuiderkwast
Copy link
Contributor

I like -THROTTLED. We can maybe refine the exact error message later. I think the word throttle can mean to slow down, so maybe rate limiting is a better term? We can also encode some message about how long to wait before trying again -BACKOFF 3 seconds please. (Not literally this but you get the idea.) Alternatively, we can just document that clients should wait for at least X seconds before retrying. They shouldn't retry immediately.

I think we need a config and a CLIENT CAPA.

If we only throttle clients that report that they support throttling, we can assume that they behave correctly and don't immediately resend the commands. They should wait some time before resending them, or delegate to the application code behind. If it's a web API, it can return 429 Too Many Requests back to the HTTP client. (It's the HTTP equivalent to -THROTTLED.)

Blocking however is quite bad in a real-time application with latency requirements and even in say a website or mobile app, you wouldn't want things to just hang. An error and retrying later is much better. Then the application can handle this in a suitable way, either delaying and retying or pushing back the throttling to the end user.

Another reason to not block/delay commands is that it can make command queue up. Subsequent non-write commands sent on the same connection (AKA pipelined) get blocked too, the client buffers grow, client eviction kicks in... It doesn't seem great to me.

@zuiderkwast
Copy link
Contributor

@hpatro you have a point about CLIENT CAPA though. I think we can require that any 9.0 compatible client has to handle -THROTTLED, though it's fine to just return it back to the caller. A new CAPA for just returning an error back to the caller would be a bit weird I think. Actually I'd expect this to already be the default client behavior for any unexpected error reply.

@hpatro
Copy link
Collaborator

hpatro commented Feb 17, 2025

I also don't mind some mechanism to indicate some form of backoff. I had put up that thought in the PR (#1672 (comment)). Yeah, I was trying to highlight CAPA at an individual error would be a bit too much.

@madolson
Copy link
Member

madolson commented Feb 17, 2025

Generally speaking, I think blocking clients on replication backlog overflow is a breaking change.

Not really sure how this is a breaking change. From the client side it just looks like the server is slow.

Maybe also worth mentioning that for short bursts, the blocking time won't be that long. If we temporarily exceed the limit we will start blocking until we can drain the replica.

Yeah, I was trying to highlight CAPA at an individual error would be a bit too much.

I think we need a better strategy though. It's important to remember that devops folks are not the same as application developers. If an application developer is not prepared to handle this new THROTTLED error, they are going to have issues with their application and likely retry (like was mentioned), even if the devops folks want it. Another approach might be that the config for throttling might support three modes:

  1. Never return throttle, disconnect the client.
  2. Respect the client capability to support THROTTLED if it is supported, but if it's not disconnect the replica.
  3. Prefer keeping the replica in sync. If the client supports it, return throttled, otherwise block or disconnect the client.

This gives operators and managed services a softer lever. Clients can 'opt' in by sending the capability but aren't required to do so.

I also really don't like error modes which only surface during other failure modes. You don't want to experience a new error only when you are under memory pressure. One mitigation could be that clients could be required to support this capability when connecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Idea
Development

No branches or pull requests

6 participants