Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering feedback #112

Closed
avermeil opened this issue Feb 5, 2019 · 9 comments
Closed

Clustering feedback #112

avermeil opened this issue Feb 5, 2019 · 9 comments

Comments

@avermeil
Copy link

avermeil commented Feb 5, 2019

I sometimes get an error thrown in my limiter.on('error', err => console.log(err)) handler:

ReplyError: ERR Error running script (call to f_c8e3e22936b90c2d0e919964acaf643f234b6146): @user_script:77: user_script:77: attempt to perform arithmetic on a nil value

When this error happens, jobs currently EXECUTING seem to disappear, and the code using the limiter hangs forever.

This is my setup code:

const bottleneck_options = {
    id: 'awql_throttler',
    maxConcurrent: 50,
    minTime: 30,
    datastore: 'redis',
    clearDatastore: false,
    clientOptions: {
        host: redis_host,
        auth_pass: redis_password,
    },
    timeout: 1000 * 60,
}

const throttler = new Bottleneck(bottleneck_options)

This is how I add jobs:

        throttler.wrap(slowFunction).withOptions(
            {
                expiration: 1000 * 60 * 2, 
            },
            args
        )

and here are the arguments to the failing lua code:

command: 'EVALSHA',
  args:
   [ 'c8e3e22936b90c2d0e919964acaf643f234b6146',
     8,
     'b_awql_throttler_settings',
     'b_awql_throttler_job_weights',
     'b_awql_throttler_job_expirations',
     'b_awql_throttler_job_clients',
     'b_awql_throttler_client_running',
     'b_awql_throttler_client_num_queued',
     'b_awql_throttler_client_last_registered',
     'b_awql_throttler_client_last_seen',
     1549381312251,
     'np7jc1nq29s',
     '5f3qr3bs7ok' ],

I'm not able to find a repeatable way to reproduce this error :(

We're using redis 2.8.4 on a host a few ms of latency away, with a dozen or so processes across different hosts using the same limiter, which is almost always at max capacity (50 jobs). The error happens several times a minute.

Thanks for any help with this!

@SGrondin
Copy link
Owner

SGrondin commented Feb 5, 2019

Hi, thanks for the excellent error report. This is definitely a bug in Bottleneck itself.

  • Which version of Bottleneck are you using?
  • Did it only start throwing this error in the last 48 hours?

This is usually a very easy fix, I just need an answer to these 2 questions. Will fix ASAP.

@avermeil
Copy link
Author

avermeil commented Feb 5, 2019

  • I'm using Bottleneck v2.16.0
  • I think that it's been throwing with this error for more than 48 hours, with previous Bottleneck versions (at least v2.15)

@SGrondin
Copy link
Owner

SGrondin commented Feb 5, 2019

Found it. It's a race condition that can only happen at relatively high load or in relatively large clusters. The race condition was already handled in one place, but not in a second one just a few lines later.

tonumber(weights[i]) should be (tonumber(weights[i]) or 0)

Basically, limiter 1 cleaned up an expired job while limiter 2 was still running it, then limiter 2's timer expires, it goes to mark the job as expired. It tries to compute how much capacity is now available for new jobs, but it can't find its weight value (deleted by limiter 1) and throws an exception.

I'll release v2.16.1 with this fix later today.

@SGrondin
Copy link
Owner

SGrondin commented Feb 5, 2019

Please try v2.16.1 and let me know if the problem persists.

By the way you should make sure your job expirations aren't longer than your cluster timeout value. I'll be adding a validation for that in the next release.

Out of curiosity, how many servers do you have in that cluster? And other than this error, is it working as expected? Thanks 😄

@avermeil
Copy link
Author

avermeil commented Feb 6, 2019

I've just released with this new version -- so far so good :)

I've also increased the cluster timeout value -- you're right that it doesn't make sense. Although in practice our cluster is never idle for more than a minute anyway.

We've got one redis server. As for the processes using Bottleneck, we've got 4 servers, each running roughly 4 node processes. They're all constantly competing for a rate-limited API resource, which is why we're using Bottleneck.

I think it's working as expected! But I have to admit that it's quite difficult to test. Sometimes it feels like things gradually slow down over the course of days, and restarting the processes or clearing the b_* keys makes things faster again. But I'm not sure whether it's an issue with Bottleneck or some other moving part.

Thanks for the super quick fix, it's very much appreciated! 👍

@SGrondin SGrondin changed the title Error in Lua script using redis clustering Clustering feedback Feb 6, 2019
@SGrondin
Copy link
Owner

SGrondin commented Feb 6, 2019

This is very valuable feedback, thank you. It's hard to get that kind of Clustering feedback since most usage is in commercial/closed source software.

Next time it starts slowing down, could you please look into the following for me?

  1. Do you only have to restart the Node processes to make it faster again? Or only delete the b_* keys? Or is it still slow until you do both things?
  2. When the system is slow, is the CPU usage higher on the Node processes than when it's fast?
  3. When the system is slow, is the network usage (from Bottleneck) higher on your Redis server? When you've reset everything and it's fast again, is the network usage back to normal?

Tip: the servers will recreate the keys automatically if they're missing.

Important: when you delete the b_* keys make sure your servers are not running, because they might be in the middle of running jobs and they would corrupt the data.

Thank you in advance, this will help me figure out if it's a Bottleneck issue that I need to reproduce and fix, or if it's an issue in your system.

@SGrondin
Copy link
Owner

The slowdown you're noticing might be due to clock drift between your servers. When things start slowing down, please take a look and make sure they're all in agreement. I have plans to use the Redis server's clock time instead, but it will require Redis 3.2 or higher. Redis 5 recently came out and even 3.2 is not supported anymore. However, since you and many other people are still using 2.8, I'm waiting a little bit longer before making this change.

I'm still interested to hear more about this slowdown next time it happens, answers to the questions in my previous post would greatly help.

@avermeil
Copy link
Author

It's possible that clocks could be the culprit. Although we use the standard ubuntu NTP setup, which I think should keep things fairly synced up.

Besides a potential re-write of the clock functionality, are there any other reasons why we should we should upgrade our redis instance for Bottleneck? Today, we're on v2.8 only because that's what came installed by default...

As for the slowdown issue, it seems to have been very solid since your last fix.

  • I haven't cleared the b_* keys since (and I had only started doing it because of this bug you fixed). I consider the clearing of keys in redis a closed affair.
  • Slowdowns that are fixed by restarting the server could really be caused by any number of things, and honestly I don't think that bottleneck is one of the prime suspects. But the next time it happens (and that I think it might be caused by bottleneck), I'll give you a proper breakdown of what's going on, including CPU usage and redis network usage.

@SGrondin
Copy link
Owner

Besides a potential re-write of the clock functionality, are there any other reasons why we should we should upgrade our redis instance for Bottleneck? Today, we're on v2.8 only because that's what came installed by default...

Not really, Redis is so reliable there is little reason to upgrade. When I decide to improve the cluster performance and accuracy by making this Time change, users with a version <3.2 will get a helpful error message upon first connection.

As for the slowdown issue, it seems to have been very solid since your last fix.

🎉 🎉 🎉

  • Slowdowns that are fixed by restarting the server could really be caused by any number of things, and honestly I don't think that bottleneck is one of the prime suspects. But the next time it happens (and that I think it might be caused by bottleneck), I'll give you a proper breakdown of what's going on, including CPU usage and redis network usage.

Perfect. I'm going to close this issue, please open another one if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants