Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CPU throttling #1689

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

lschmidtcavalcante-sc
Copy link

@lschmidtcavalcante-sc lschmidtcavalcante-sc commented Feb 7, 2025

Proposed implementation of CPU throttling (still a work in progress).

Implementation adds a new config cpu_overload_protection_threshold: if set to 0, then no CPU throttling; otherwise throttle a percentage of the requests (cpu_overload_protection_throttling_percentage) if CPU usage is above said threshold.

The percentage of requests to be throttled (responded with THROTTLED) is adjusted every 10 seconds:

  • If CPU usage is above threshold, then double the amount of requests to be throttled (starting at 1);
  • Otherwise, halve the amount of requests to be throttled (eventually reaching zero).

Note that some commands are never throttled (see is_client_excluded_from_cpu_throttling and is_cmd_excluded_from_cpu_throttling): they are mostly commands related to intra-cluster communication; afterall, we want to be able to scale up the cluster to move away from this state.

Issue #1688

Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>
@PingXie
Copy link
Member

PingXie commented Feb 8, 2025

Thanks for the contribution, @lschmidtcavalcante-sc!

This is a very valid problem to solve for clusters but I don't think we should back-pressure on CPU usage, which doesn't go linearly with the workload level. The main challenges that I can think of, ranked by their impact and likelihood, are

  1. Long running Lua scripts block the main thread.

Today we can only safely timeout a read-only LUA script but as soon as the script modifies a key, it has to run to completion. Aborting a write script safely would require applying changes in an all-or-none way.

Along this line, long running non-script commands (with O(N) complexity) also risk starving the cluster bus, leading to premature node failures. We could consider a generic solution to either timeout them or time-slicing their execution with the cluster bus processing. However, the latter could break consistency, since the cluster bus can churn the topology mid-execution.

  1. An excessive amount of active clients delays cluster connection processing.

We could explore limiting the epoll processing batch size and additionally prioritize the processing of the cluster connections. So rather than competing against 10,000 active clients, the cluster connections share the CPU time with say, 1,000 clients in every batch.

On a related note, I do think back-pressuring on memory stress is more valuable to reduce the OOM risk.

@lschmidtcavalcante-sc
Copy link
Author

lschmidtcavalcante-sc commented Feb 10, 2025

Hey @PingXie , let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost.

It is still early to tell how effective CPU throttling has been for Snap (also, we are still using KeyDB), but our internal testing indicated that clusters under heavy load with CPU throttling enabled did not lose nodes, giving extra time for the oncall to take action -- to address the comment at #1688 (comment) by @xbasel , our internal testing did not assume any handling from the clients.

@artikell
Copy link
Contributor

artikell commented Feb 14, 2025

I have the following questions:

  • What should be done if the throttled requests have fully occupied the CPU?
  • Is adding a new error message unfriendly to users? Why not consider blocking the requests? like isPausedActions?
  • In the avalanche scenario, is it very likely that the CPU will be fully occupied continuously until the service becomes unavailable?

@lschmidtcavalcante-sc
Copy link
Author

lschmidtcavalcante-sc commented Feb 14, 2025

@artikell 好 what do you mean by throttled requests have fully occupied the CPU? Regarding the introduction of a new error code, we intend to re-use this one #1672 and I don't think we had any issues at Snap introducing a new error code (@ovaiskhansc can correct me) -- we also don't control which client people use to access our clusters, nor does this solution expects any handling from the client side.

Keep in mind that this is a feature behind a flag that I expect to be disabled by default.

@PingXie
Copy link
Member

PingXie commented Feb 18, 2025

let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost.

Thanks for the context. I think this falls under the second scenario. I don't think the proposed solution would be effective. By the time we get to processCommand, a significant amount of work has been done (epoll, read, command parsing, etc). Throwing away the command at this stage is quite wasteful, and I doubt it would help. The knob itself is also not very intuitive and can be hard to tune on different platforms. I would highly recommend introducing a proper QoS (Quality of Service) model where cluster traffic is prioritized over customer traffic. For instance, we could use two separate epoll file descriptors with one for just cluster connections and the other for all other connections, including both normal and replica clients. We would then always poll the cluster epfd first and process all the active cluster connections. For the other epfd, we could set a limit, say the max number of active connections processed in one batch. We could also limit the total CPU time spent on processing the active connections returned by the epoll of the normal epfd in one go. This CPU time limit doesn't need to be super accurate. A lower limit just means we give the cluster traffic a bit more time slice but it won't end up increasing the failure rate for the user traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants