Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4 - requests against connections with unexpectedly closed sockets don't get fulfilled after reconnect #1789

Closed
dhensby opened this issue Dec 17, 2021 · 8 comments
Labels

Comments

@dhensby
Copy link

dhensby commented Dec 17, 2021

I've recently upgraded from v3 to v4 and noticed that we're seeing client.get() calls that are hanging. I've not managed to pin-point the exact area where the problem is originating (compiled typescript doesn't make this easy) but we are noticing this in conjunction with connections that are being closed (presumably by the sever) after inactivity.

Cursory investigations appear to point to connections where the socket has been closed by the server (after ~10 minutes of inactivity) which are then resulting in a re-connection attempts (which is ultimately successful) but the queued get request is never being fulfilled.

I've remedied this by adding a ping keep-alive (as suggested here, though implemented in a slightly less naive way) and the problem has gone away (we're no longer seeing closed connections and no longer seeing unfulfilled requests).

This was not a problem in v3, the connection may timeout but a reconnect was made and the get call fulfilled successfully.

Possibly related issues:

I'll spend a bit of time trying to reproduce this in a test, but ultimately I've papered over this issue by implementing a ping-based keep-alive for my client.

Other comments:

In v3 this situation (attempting to make a request when the underlying stream was closed) resulted in a reconnecting event and no error event. In v4 we are seeing no reconnecting event and we are seeing an error event. I'm not sure if this is an intended change, but it makes discerning genuine errors and errors that can be handled by the built in re-connection strategy harder.

Environment:

  • Node.js Version: v14.18.1
  • Redis Server Version: 4.0.14 (Azure Cache for Redis)
  • Node Redis Version: 4.0.1
  • Platform: Windows (Azure Function App, Node Runtime)
@dhensby dhensby added the Bug label Dec 17, 2021
@dhensby

This comment has been minimized.

@leibale
Copy link
Contributor

leibale commented Dec 19, 2021

@dhensby which version of npm are you using? because the project uses npm workspaces, which were introduced in npm v7

Regarding the issue itself - "When a socket closed unexpectedly, all the commands that were already sent will reject as they might have been executed on the server. The rest will remain queued in memory until a new socket is established. If the client is closed—either by returning an error from reconnectStrategy or by manually calling .disconnect()—they will be rejected." (https://github.com/redis/node-redis/blob/master/docs/FAQ.md#what-happens-when-the-network-goes-down)

@dhensby
Copy link
Author

dhensby commented Dec 19, 2021

@leibale - thanks for the pointer, I upgraded NPM and it has built the tools now. I thought I had done this (as I noticed the github workflow doing it) but clearly not...

Regarding the issue:

  1. What you cite is about the network going down (this issue isn't about the network going down but the socket closed unexpectedly situation which is the specific handling of network connections that have been closed by the server due to idle timeouts triggering on the server side).
  2. The requests neither resolve nor reject, they do not fulfil at all as per the issue description. The promise "hangs" indefinitely even though reconnection is established successfully by the retry mechanism.

If what you cite is intended behaviour, then it's a pretty big regression from v3 and the logic that the request shouldn't be sent to the server because it may have already been fulfilled doesn't make sense if the socket was already closed before the request was sent and especially makes no sense for read operations (read operations could not have been fulfilled by the server, at leat not as far as the client/application is concerned, if the connection is down).

@leibale
Copy link
Contributor

leibale commented Dec 20, 2021

My bad, I've missed the "hanging" part... 🤦‍♂️

This is not the intended behavior, and I was thinking that this test covers this, but I guess I'm missing something.. do you have a reproduction of that bug?

@dhensby
Copy link
Author

dhensby commented Dec 20, 2021

I'm not sure killClient is quite doing what we want to replicate the problem. It's a problem that happens on the underlying socket so if we want to truly replicate it we need the underlying socket to emit an error event (ECONNRESET aka SocketClosedUnexpectedlyError). Then it's down to the pipeline handling in the client (I'm guessing here) to establish a new connection and then send the queued commands though the new socket.

I've been battling a little to get the test suite running locally, but I can run it enough to try to create a test now. I just need a way to stub/force the socket to emit an error event as needed 🤔

@dhensby
Copy link
Author

dhensby commented Jan 3, 2022

Hi @cit68. @leibale has written a subsequent comment indicating that the quote does not apply here.

This is a bug and there are some changes in behaviour since v3, I'm not clear on if they are intentional.

The main changes for when the connection has been reset (disconnected due to server-side timeout) then:

  1. an error event is emitted (this didn't happen in v3 that I ever noticed).
  2. no reconnecting event is emitted on the first reconnection attempt (this used to happen in v3).

This issue isn't pertaining to those changes as they could well be intentional (though my personal preference would be that there was no error event and there was a reconnecting event emitted, because I don't want to have to filter "false" errors from the error listener).

The issue here is that the request that triggers the reconnection logic in these circumstances never resolves.

@cit68
Copy link

cit68 commented Jan 3, 2022

@leibale Sorry!!!
While replying to the comment I was removing mine

The reason is that I had put a wrong label in the "reconnect" handler so I thought it was not possible to have a successful pong and write / read command without any reconnection and created the previous comment.

But really there is an automatic reconnection before these commands, so my comment made no sense to exist.

Thanks for the reply.

But reading your point n°2
"no reconnecting event is emitted on the first reconnection attempt (this used to happen in v3)."
In my current test (v4.0.1) it's present reconnection attempt!

@dhensby
Copy link
Author

dhensby commented Jan 3, 2022

But reading your point n°2
"no reconnecting event is emitted on the first reconnection attempt (this used to happen in v3)."
In my current test (v4.0.1) it's present reconnection attempt!

Ah, ok - Maybe I'm mis-remembering. I haven't looked at this for a couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants