-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of CPU throttling #1689
base: unstable
Are you sure you want to change the base?
Implementation of CPU throttling #1689
Conversation
Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>
Thanks for the contribution, @lschmidtcavalcante-sc! This is a very valid problem to solve for clusters but I don't think we should back-pressure on CPU usage, which doesn't go linearly with the workload level. The main challenges that I can think of, ranked by their impact and likelihood, are
Today we can only safely timeout a read-only LUA script but as soon as the script modifies a key, it has to run to completion. Aborting a write script safely would require applying changes in an all-or-none way. Along this line, long running non-script commands (with O(N) complexity) also risk starving the cluster bus, leading to premature node failures. We could consider a generic solution to either timeout them or time-slicing their execution with the cluster bus processing. However, the latter could break consistency, since the cluster bus can churn the topology mid-execution.
We could explore limiting the On a related note, I do think back-pressuring on memory stress is more valuable to reduce the OOM risk. |
Hey @PingXie , let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost. It is still early to tell how effective CPU throttling has been for Snap (also, we are still using KeyDB), but our internal testing indicated that clusters under heavy load with CPU throttling enabled did not lose nodes, giving extra time for the oncall to take action -- to address the comment at #1688 (comment) by @xbasel , our internal testing did not assume any handling from the clients. |
I have the following questions:
|
@artikell 好 what do you mean by throttled requests have fully occupied the CPU? Regarding the introduction of a new error code, we intend to re-use this one #1672 and I don't think we had any issues at Snap introducing a new error code (@ovaiskhansc can correct me) -- we also don't control which client people use to access our clusters, nor does this solution expects any handling from the client side. Keep in mind that this is a feature behind a flag that I expect to be disabled by default. |
Thanks for the context. I think this falls under the second scenario. I don't think the proposed solution would be effective. By the time we get to |
Proposed implementation of CPU throttling (still a work in progress).
Implementation adds a new config
cpu_overload_protection_threshold
: if set to 0, then no CPU throttling; otherwise throttle a percentage of the requests (cpu_overload_protection_throttling_percentage
) if CPU usage is above said threshold.The percentage of requests to be throttled (responded with
THROTTLED
) is adjusted every 10 seconds:Note that some commands are never throttled (see
is_client_excluded_from_cpu_throttling
andis_cmd_excluded_from_cpu_throttling
): they are mostly commands related to intra-cluster communication; afterall, we want to be able to scale up the cluster to move away from this state.Issue #1688