createCluster clients don't handle on('error') correctly #2721

kseth · 2024-03-19T23:53:05Z

Description

We use cluster-mode with redis for sharded pub-sub (we have 3 masters and 3 replicas in a kubernetes cluster).

We have the following args for the clients:

    const clusterArgs = {
      rootNodes: [
        {
          url: `redis://${REDIS_CLUSTER_PUBSUB_HOST}:${redisPort}`,
        },
      ],
      defaults: {
        username: REDIS_CLUSTER_PUBSUB_NAME,
        password: REDIS_CLUSTER_PUBSUB_PASS,
        socket: {
          reconnectStrategy(retries: number) {
            if (retries >= 10) {
              console.error(
                `lost connection to redis cluster-pubsub cluster: tried ${retries} times`
              );
            } else {
              console.warn(
                `retrying redis cluster-pubsub cluster connection: tried ${retries} times`
              );
            }

            // reconnect after
            return Math.min(retries * 200, 2000);
          },
          connectTimeout: 10000,
          keepAlive: 60000,
        },
      },
    };

and then we create the client(s) like this:

const client = createCluster(clusterArgs);
await client.connect();
client.on('error', (err) => {
  console.error(`[PUB-SUB ERROR]: ${err}`);
});

Sometimes our redis pub-sub cluster goes down (i.e. for maintenance, when we upgrade to a new version, since we run it in kubernetes), and we'll receive the following error:

Error: Socket closed unexpectedly

We correctly log the error by catching it in the error handler, but we never seem to retry / reconnect -- the only way I can get a reconnect to actually happen is to continually restart the process until the reconnection succeeds.

Also, if the process tries to issue a command, we sometimes get an internal error killing the process because of a node uncaught exception, even though I've added a client.on('error') above.

I followed the findings from #2120 and #2302, but those don't really seem to solve our problems.

What I'd like is to be able to specify a reconnect strategy so that we continually try to retry (according to the reconnectStrategy) if we lose our TLS connection / fail to talk to a node in the cluster. Also, I'd like that we actually queue messages when we're offline instead of throwing an error and taking down the process.

Node.js Version

20.11.1

Redis Server Version

7.0.10

Node Redis Version

4.6.13

Platform

linux

Logs

No response

The text was updated successfully, but these errors were encountered:

cmotl3y · 2024-05-11T18:32:01Z

I'm also observing this same behavior.

jjsimps · 2024-05-21T00:09:58Z

I am also seeing this same behavior. It seems to happen for us whenever we perform something that will require a rolling update to redis (kubernetes) -- for example a version upgrade or a config update.

The application will end up spewing errors, and never reconnect, even when the cluster is healthy for a long time.

The only solution I could come up with is (like OP) to restart the process (in my case, failing the health check so k8 restarts the process).

This is certainly a non-ideal solution.

The general steps to repro:

Run app in k8 that uses redis in cluster mode (i had 3master/3slave configuration)
Perform a rollout restart of redis (this just incrementally restarts each redis)
Notice how redis never reconnects properly

DusanBaric · 2024-05-30T12:14:55Z

I can also confirm this to be an issue. Our setup is almost identical to original post.

After doing any kind of rolling update to kubernetes redis cluster and pods are restarted we are getting Socket closed unexpectedly on clients and that is alright, but as explained above retry strategy never kicks in and cluster client ends in closed state.

While restarting affected service will reconnect properly, we have temporary workaround in place that should hopefully end up in less downtime. I have pasted simplified example below.

client.on('error', err => {
  console.error('Redis cluster error:', err);

  if (!recoonecting) {
    recoonecting = setTimeout(async () => {
      recoonecting = undefined;
      try {
        await reconnect();
      } catch (err) { }
    }, 5000);
  }
});

This is rather serious issue that should get more attention as at the moment there is no real alternative then this library to connect to redis cluster in node.js environment (we have had even worse experience when it comes to redis cluster with ioredis when using kubernetes).

drewbitz · 2024-06-03T19:26:29Z

I'm also observing this same behavior.

arcogabbo · 2024-06-20T08:18:30Z

I find out that version 4.6.14 and 4.6.13(the ones I tested) are bugged and with a redis cluster going down the reconnectStrategy would not trigger. At the moment the only resolution was to revert to a working version, like 4.5.1. I think some regression has been introduced but I don't know much.

carlhopf · 2024-06-25T14:07:06Z

i am seeing the same behaviour in my cluster with 4.6.14

downgrading to 4.5.1 is sadly not an option:

it does not seem to support hostname endpoints via --cluster-preferred-endpoint-type hostname and will try to reconnect to outdated node IPs (which changed after k8s rolling update), instead of using on the announced hostname

testing with ioredis shows no problems reconnecting to restarting cluster, as it's periodically rediscovering the slots/nodes every 5s. maybe that could be part of the solution here too?

rlindner81 · 2024-07-02T06:52:08Z

@leibale can you or one of the active maintainers take a look here?

We use node-redis as a dependency in our library and multiple projects that use our library have reported this problem. When a connection error occurs with the cluster, we need the reconnect strategy to kick in. But instead the client closes and all subsequent communication is blocked till a server restart occurs.

We would like to avoid building a workaround here. We use the library's client with near identical code for clusters and single instances and for single instances the reconnect works as expected. So any workaround will need some novel code separation. A workaround is also likely to clash with any future fixes in node-redis and the related dependency version limitations will cause operational overhead and confusion in the user base...

soccermax · 2024-07-02T12:23:57Z

same issue - this PR is fixing the issue: #2783

rlindner81 · 2024-07-02T13:04:37Z

I can confirm #2783 fixes the issue for us.

leibale · 2024-07-02T17:47:28Z

@redis/[email protected]/[email protected] is on npm 🎉
@soccermax thanks! :)

soccermax · 2024-07-02T17:58:21Z

Thanks for the quick release. Very much appreciated 👍

soccermax · 2024-11-25T10:25:00Z

Broken again; fix got removed again. This PR fixes the issue again: #2870

kseth added the Bug label Mar 19, 2024

soccermax mentioned this issue Jul 2, 2024

fix createCluster - copy options.defaults.socket before modifying it #2783

Merged

3 tasks

leibale closed this as completed in #2783 Jul 2, 2024

soccermax mentioned this issue Nov 25, 2024

Fix creation of cluster client again #2870

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

createCluster clients don't handle on('error') correctly #2721

createCluster clients don't handle on('error') correctly #2721

kseth commented Mar 19, 2024 •

edited

Loading

cmotl3y commented May 11, 2024

jjsimps commented May 21, 2024

DusanBaric commented May 30, 2024

drewbitz commented Jun 3, 2024

arcogabbo commented Jun 20, 2024

carlhopf commented Jun 25, 2024 •

edited

Loading

rlindner81 commented Jul 2, 2024

soccermax commented Jul 2, 2024

rlindner81 commented Jul 2, 2024

leibale commented Jul 2, 2024

soccermax commented Jul 2, 2024

soccermax commented Nov 25, 2024

createCluster clients don't handle on('error') correctly #2721

createCluster clients don't handle on('error') correctly #2721

Comments

kseth commented Mar 19, 2024 • edited Loading

Description

Node.js Version

Redis Server Version

Node Redis Version

Platform

Logs

cmotl3y commented May 11, 2024

jjsimps commented May 21, 2024

DusanBaric commented May 30, 2024

drewbitz commented Jun 3, 2024

arcogabbo commented Jun 20, 2024

carlhopf commented Jun 25, 2024 • edited Loading

rlindner81 commented Jul 2, 2024

soccermax commented Jul 2, 2024

rlindner81 commented Jul 2, 2024

leibale commented Jul 2, 2024

soccermax commented Jul 2, 2024

soccermax commented Nov 25, 2024

kseth commented Mar 19, 2024 •

edited

Loading

carlhopf commented Jun 25, 2024 •

edited

Loading