-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering feedback #112
Comments
Hi, thanks for the excellent error report. This is definitely a bug in Bottleneck itself.
This is usually a very easy fix, I just need an answer to these 2 questions. Will fix ASAP. |
|
Found it. It's a race condition that can only happen at relatively high load or in relatively large clusters. The race condition was already handled in one place, but not in a second one just a few lines later.
Basically, limiter 1 cleaned up an expired job while limiter 2 was still running it, then limiter 2's timer expires, it goes to mark the job as expired. It tries to compute how much capacity is now available for new jobs, but it can't find its I'll release v2.16.1 with this fix later today. |
Please try v2.16.1 and let me know if the problem persists. By the way you should make sure your job expirations aren't longer than your cluster timeout value. I'll be adding a validation for that in the next release. Out of curiosity, how many servers do you have in that cluster? And other than this error, is it working as expected? Thanks 😄 |
I've just released with this new version -- so far so good :) I've also increased the cluster timeout value -- you're right that it doesn't make sense. Although in practice our cluster is never idle for more than a minute anyway. We've got one redis server. As for the processes using Bottleneck, we've got 4 servers, each running roughly 4 node processes. They're all constantly competing for a rate-limited API resource, which is why we're using Bottleneck. I think it's working as expected! But I have to admit that it's quite difficult to test. Sometimes it feels like things gradually slow down over the course of days, and restarting the processes or clearing the Thanks for the super quick fix, it's very much appreciated! 👍 |
This is very valuable feedback, thank you. It's hard to get that kind of Clustering feedback since most usage is in commercial/closed source software. Next time it starts slowing down, could you please look into the following for me?
Tip: the servers will recreate the keys automatically if they're missing. Important: when you delete the Thank you in advance, this will help me figure out if it's a Bottleneck issue that I need to reproduce and fix, or if it's an issue in your system. |
The slowdown you're noticing might be due to clock drift between your servers. When things start slowing down, please take a look and make sure they're all in agreement. I have plans to use the Redis server's clock time instead, but it will require Redis 3.2 or higher. Redis 5 recently came out and even 3.2 is not supported anymore. However, since you and many other people are still using 2.8, I'm waiting a little bit longer before making this change. I'm still interested to hear more about this slowdown next time it happens, answers to the questions in my previous post would greatly help. |
It's possible that clocks could be the culprit. Although we use the standard ubuntu NTP setup, which I think should keep things fairly synced up. Besides a potential re-write of the clock functionality, are there any other reasons why we should we should upgrade our redis instance for Bottleneck? Today, we're on v2.8 only because that's what came installed by default... As for the slowdown issue, it seems to have been very solid since your last fix.
|
Not really, Redis is so reliable there is little reason to upgrade. When I decide to improve the cluster performance and accuracy by making this Time change, users with a version <3.2 will get a helpful error message upon first connection.
🎉 🎉 🎉
Perfect. I'm going to close this issue, please open another one if needed. |
I sometimes get an error thrown in my
limiter.on('error', err => console.log(err))
handler:ReplyError: ERR Error running script (call to f_c8e3e22936b90c2d0e919964acaf643f234b6146): @user_script:77: user_script:77: attempt to perform arithmetic on a nil value
When this error happens, jobs currently
EXECUTING
seem to disappear, and the code using the limiter hangs forever.This is my setup code:
This is how I add jobs:
and here are the arguments to the failing lua code:
I'm not able to find a repeatable way to reproduce this error :(
We're using redis 2.8.4 on a host a few ms of latency away, with a dozen or so processes across different hosts using the same limiter, which is almost always at max capacity (
50
jobs). The error happens several times a minute.Thanks for any help with this!
The text was updated successfully, but these errors were encountered: