Implementation of CPU throttling #1689

lschmidtcavalcante-sc · 2025-02-07T20:42:28Z

Proposed implementation of CPU throttling (still a work in progress).

Implementation adds a new config cpu_overload_protection_threshold: if set to 0, then no CPU throttling; otherwise throttle a percentage of the requests (cpu_overload_protection_throttling_percentage) if CPU usage is above said threshold.

The percentage of requests to be throttled (responded with THROTTLED) is adjusted every 10 seconds:

If CPU usage is above threshold, then double the amount of requests to be throttled (starting at 1);
Otherwise, halve the amount of requests to be throttled (eventually reaching zero).

Note that some commands are never throttled (see is_client_excluded_from_cpu_throttling and is_cmd_excluded_from_cpu_throttling): they are mostly commands related to intra-cluster communication; afterall, we want to be able to scale up the cluster to move away from this state.

Issue #1688

Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>

PingXie · 2025-02-08T19:23:55Z

Thanks for the contribution, @lschmidtcavalcante-sc!

This is a very valid problem to solve for clusters but I don't think we should back-pressure on CPU usage, which doesn't go linearly with the workload level. The main challenges that I can think of, ranked by their impact and likelihood, are

Long running Lua scripts block the main thread.

Today we can only safely timeout a read-only LUA script but as soon as the script modifies a key, it has to run to completion. Aborting a write script safely would require applying changes in an all-or-none way.

Along this line, long running non-script commands (with O(N) complexity) also risk starving the cluster bus, leading to premature node failures. We could consider a generic solution to either timeout them or time-slicing their execution with the cluster bus processing. However, the latter could break consistency, since the cluster bus can churn the topology mid-execution.

An excessive amount of active clients delays cluster connection processing.

We could explore limiting the epoll processing batch size and additionally prioritize the processing of the cluster connections. So rather than competing against 10,000 active clients, the cluster connections share the CPU time with say, 1,000 clients in every batch.

On a related note, I do think back-pressuring on memory stress is more valuable to reduce the OOM risk.

lschmidtcavalcante-sc · 2025-02-10T21:03:34Z

Hey @PingXie , let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost.

It is still early to tell how effective CPU throttling has been for Snap (also, we are still using KeyDB), but our internal testing indicated that clusters under heavy load with CPU throttling enabled did not lose nodes, giving extra time for the oncall to take action -- to address the comment at #1688 (comment) by @xbasel , our internal testing did not assume any handling from the clients.

artikell · 2025-02-14T01:52:37Z

I have the following questions:

What should be done if the throttled requests have fully occupied the CPU?
Is adding a new error message unfriendly to users? Why not consider blocking the requests? like isPausedActions?
In the avalanche scenario, is it very likely that the CPU will be fully occupied continuously until the service becomes unavailable?

lschmidtcavalcante-sc · 2025-02-14T19:48:02Z

@artikell 好 what do you mean by throttled requests have fully occupied the CPU? Regarding the introduction of a new error code, we intend to re-use this one #1672 and I don't think we had any issues at Snap introducing a new error code (@ovaiskhansc can correct me) -- we also don't control which client people use to access our clusters, nor does this solution expects any handling from the client side.

Keep in mind that this is a feature behind a flag that I expect to be disabled by default.

PingXie · 2025-02-18T01:14:47Z

let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost.

Thanks for the context. I think this falls under the second scenario. I don't think the proposed solution would be effective. By the time we get to processCommand, a significant amount of work has been done (epoll, read, command parsing, etc). Throwing away the command at this stage is quite wasteful, and I doubt it would help. The knob itself is also not very intuitive and can be hard to tune on different platforms. I would highly recommend introducing a proper QoS (Quality of Service) model where cluster traffic is prioritized over customer traffic. For instance, we could use two separate epoll file descriptors with one for just cluster connections and the other for all other connections, including both normal and replica clients. We would then always poll the cluster epfd first and process all the active cluster connections. For the other epfd, we could set a limit, say the max number of active connections processed in one batch. We could also limit the total CPU time spent on processing the active connections returned by the epoll of the normal epfd in one go. This CPU time limit doesn't need to be super accurate. A lower limit just means we give the cluster traffic a bit more time slice but it won't end up increasing the failure rate for the user traffic.

Implementation of CPU throttling

46fee7c

Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of CPU throttling #1689

Implementation of CPU throttling #1689

lschmidtcavalcante-sc commented Feb 7, 2025 •

edited

Loading

PingXie commented Feb 8, 2025

lschmidtcavalcante-sc commented Feb 10, 2025 •

edited

Loading

artikell commented Feb 14, 2025 •

edited

Loading

lschmidtcavalcante-sc commented Feb 14, 2025 •

edited

Loading

PingXie commented Feb 18, 2025

Implementation of CPU throttling #1689

Are you sure you want to change the base?

Implementation of CPU throttling #1689

Conversation

lschmidtcavalcante-sc commented Feb 7, 2025 • edited Loading

PingXie commented Feb 8, 2025

lschmidtcavalcante-sc commented Feb 10, 2025 • edited Loading

artikell commented Feb 14, 2025 • edited Loading

lschmidtcavalcante-sc commented Feb 14, 2025 • edited Loading

PingXie commented Feb 18, 2025

lschmidtcavalcante-sc commented Feb 7, 2025 •

edited

Loading

lschmidtcavalcante-sc commented Feb 10, 2025 •

edited

Loading

artikell commented Feb 14, 2025 •

edited

Loading

lschmidtcavalcante-sc commented Feb 14, 2025 •

edited

Loading