-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mplex stream termination issue #1814
Comments
You mention the error "IO Error: bytes remaining on stream", but that is not included in the logs. Could you share or point me to the code you are using for this request/response protocol?
Does that mean the problem did not occur after #1769 but only after #1784 and / or #1785?
Could you point me to where you see that in the logs?
Could you also point me to where you see data being dropped in the logs? This line
only says that a received
and this on the "Receiver"
which doesn't look overly suspicious on first sight. In particular, there are logs from the following 3 substreams in your excerpts, one of which was actually initiated by the "Receiver" node:
I'm not sure which of these you are referring to w.r.t. the problem description. The logs don't seem to contain a full request-response exchange for any of these. Furthermore, in the "Receiver" logs there are these lines interleaved:
These are prefix-dated Oct 24th, and I'm not sure what to make of these at the moment. Thanks for your report and your help in looking into this, I'm just not yet able to see the problem clearly. I'm still hoping for clearer logs and more insight into the code you're using to better diagnose the problem. |
Hey @romanb. Thanks for the quick reply. I've been meaning to get back to you earlier but wanted more substantial info to help diagnose this. I saw you sorted out sigp/lighthouse#1832 thanks! I thought that may be related, but it occurs on an earlier libp2p-version than the issue here.
I wanted to spare you the RPC code and ideally get us to do the leg-work to isolate it further. But the RPC code is here: https://github.com/sigp/lighthouse/tree/master/beacon_node/eth2_libp2p/src/rpc The important parts would be establishing a stream: https://github.com/sigp/lighthouse/blob/master/beacon_node/eth2_libp2p/src/rpc/protocol.rs#L291 And then the various stream management (which is somewhat complex): https://github.com/sigp/lighthouse/blob/master/beacon_node/eth2_libp2p/src/rpc/handler.rs#L628 (this is the inbound, but equivalent outbound is in there). With all the RPC logic aside, the fundamental error occurs from polling the substream: https://github.com/sigp/lighthouse/blob/master/beacon_node/eth2_libp2p/src/rpc/handler.rs#L808 This, should give the IO Error I quoted. The underlying substream is a framed timeout substream but I dont think all of this complexity is too relevant. The logs I posted were given by @pawanjay176 (perhaps his system clock is out?). I've been meaning to set up an isolated reproduceable test that can give you the full logs (and that you can run also) but am yet to have the time. @pawanjay176 however has done some further investigation and as the following thoughts:
I believe @pawanjay176 may have more to add when he is online next. As soon as I get the time, I'll also try and build some isolated tests for this. Perhaps this extra info may help a little in the meantime. |
@romanb sorry for the late reply here. I'll try to describe the setup for which I got the above logs on top of which @AgeManning has already described. We have 2 nodes running with a custom rpc behaviour + libp2p identify behaviour. Both nodes exchange a rpc metadata request followed by Node1 sending a From the logs, it looks like with the newer version of libp2p, mplex on Node1 somehow closes the stream before receiving the response from Node2 and hence ends up dropping the entire response. Full logs for Node1 and Node2 for with the older version of libp2p when it was working as expected (commit: 8cec457) Below is the summary of the issue with snippets from the logs from Node1 for the failing case.
My guess is that Node1 mplex closes the substream for |
Thanks for the further details.
There is an important misconception here, as these are not the same substreams. In Mplex a substream ID is a tuple of
So there is no substream mix-up going on here, as far as I can tell. But your logs are very helpful. In particular what they show is the following:
Note the 10 second delay of Node 2. At this point Node 1 has likely dropped the substream due to a timeout (better logs may be needed to make this more visible - 10 seconds is the default timeout for inbound/outbound substream upgrades) and hence when it receives the response at ~ |
It's probably unrelated, but I noticed that the RPC connection handler in Since I didn't find any obvious mistakes when re-reviewing the libp2p changes between the two mentioned versions, I'm at a point where I need to reproduce it for myself and experiment with the code, especially since between the two mentioned libp2p versions there is also #1775 which may trigger similar effects. So I would really appreciate short instructions on how to run |
I've had a quick crack at building a reproducible example for you. However running lighthouse on our testnet, I immediately see conflicts with older versions. It seems potentially that the update to mplex may not be backwards compatible. I think there's a lot of noise in running lighthouse on our testnet and it may not be the best way to debug this. @pawanjay176 is working on an isolated example that we can work off. |
We have managed to progress a little here. But would appreciate your help in further pinpointing the issue @romanb. We have an isolated test you can run to reproduce the errors. Pulling this repo: https://github.com/pawanjay176/libp2p_test.git and using the The test works via first running a
Then in another terminal run the sender
You can add arbitrary log events via the In these tests, we've noticed the error occurs for large RPC request/responses. The basic test is for a sender to request a large number of responses (256). The recv attempts to respond to the request by sending all these responses. Looking at the logs, it seems the sender starts receiving the responses up until the frame buffer gets filled. It then seems to send a reset and drops remaining data. I would have expected the polling of the stream to pull the data off as we received them such that we wouldn't hit this buffer limit. Also: Thanks for pointing out the non-polling of the Busy() state. That has been corrected :) (but doesn't seem related to this) |
To add to @AgeManning 's comment above, these are the logs we get on the
I believe we got the original error in lighthouse that we mentioned in the issue
because we didn't get the complete data at the application level since the stream got reset and the rest of the data got dropped. |
Some clarifications:
|
Thanks for all the new information.
When a task tries to read from a particular substream, the multiplexer eagerly reads all the frames from the underlying connection until it finds one for the substream in question (or the connection is The options for this problem are the following, both via
However, the problem that has been described in earlier comments and whose logs analyzed in #1814 (comment) seems to be different, i.e. a scenario where the sender of the responses suffers from a |
Hi @romanb that's been addressed by @AgeManning (he mentioned it in the last line of his last comment but it doesn't look like is related. I'll try running a LH node with the new libp2p and the |
Thanks @romanb It appears this was the issue. Increasing the buffer size and changing the max buffer behaviour has resolved this issue. |
We recently updated to the latest master of rust-libp2p.
After the update, it seems mplex no longer functions as we expect. Without changing any of the application logic, we see the error:
This appears to be an mplex issue as this doesn't occur when switching to yamux. It appears from looking at the commit history that the main change to mplex from our working commit to now is: #1784 and the less substantial: #1785
We are yet to pinpoint the exact issue, but can share current logs we have at the moment for our protocol.
The protocol we are using is an RPC protocol with a single short-lived stream for a request/response. A requester negotiates and opens a stream, sends a request and awaits a response over that stream.
It appears the receiving side is now closing the substream (early) after receiving the request and requester now drops the data with "unknown substream".
Requester logs:
Reciever Logs
To give a comparison to a working version:
Requester:
working receiver:
cc @pawanjay176 @romanb
The text was updated successfully, but these errors were encountered: