Cancel/Want short-circuiting is racy #347

Stebalien · 2020-04-11T05:26:43Z

Cancel Race:

Enqueue want.
Prepare message with want.
Cancel want. This won't enqueue a cancel, it will remove the want from the pending list.
Send the message with the want.
Remove the want from pending, add to sent.

However, the inverse isn't an issue:

Enqueue cancel.
Prepare message with cancel. This will remove the cancel from pending immediately.
Re-add want.
Send cancel.

So, I think the solution here is to either:

Eagerly move wants/cancels from pending to sent and move them back if something goes wrong.
Move wants from pending to sent after the fact, but enqueue a cancel if we notice that the want is no longer there.

IMO, the second solution is probably the "more correct" solution.

Stebalien · 2020-04-13T06:39:49Z

Really, if we fail to send, we should drop the peer and walk away. We can do this by:

Stopping the queue.
Marking the peer as "dead".

That is, we won't forget about the peer. Otherwise, we'll mess up our reference counting. Instead, we just mark it as dead and move on.

I'm pretty sure a large part of our issue has been peers that literally don't support the bitswap protocol. We'll currently keep retrying over and over.

dirkmc · 2020-04-13T14:59:13Z

You're right, nice catch 👍

An alternative that occurs to me is that when we call MessageQueue.AddCancels([]cid.Cid), we can check if there is a corresponding want in the message that is currently being sent.
BitSwapMessage stores the wantlist as a map, so we could add a method BitSwapMessage.FilterWantsFor(cancelCids).
For each cancel

if it's pending
- if it's currently being sent (it's in the message wantlist): enqueue cancel
- otherwise remove it from pending
otherwise enqueue cancel

Stebalien · 2020-04-13T15:46:20Z

I think we should be disconnecting from these peers regardless. I'd also like to reduce the number of ad-hoc checks.

Stebalien · 2020-04-13T15:47:03Z

Sorry, not disconnecting. We should start ignoring these peers.

dirkmc · 2020-04-13T16:00:11Z

Ah so you're saying we should just eagerly move wants/cancels from pending to sent, and if something goes wrong when trying to send the message we just shut down the message queue for the peer?

Stebalien · 2020-04-13T16:13:03Z

Yeah, basically. We could also try reviving the peer (from scratch) we receive a new "connected" event. But that's a lower priority. It's just that the current logic is really unfriendly to non-ipfs peers as we'll repeatedly hammer them with new streams even if they don't speak the protocol.

dirkmc · 2020-04-13T16:14:28Z

Makes sense 👍

Once they've failed a few dial attempts, do you think we should try to reconnect with a backoff, or just put them in a blacklist forever?

Stebalien · 2020-04-13T18:36:08Z

For now, I just want to drop them (and re-activate when we receive a new "connect" event).

dirkmc · 2020-04-14T15:35:18Z

If the peer doesn't respond for a long time, I think we actually need to simulate it having "disconnected" so that it's removed from the session. This allows the session to stop sending want-block to the peer and to search for better peers.

The session will re-add the peer if it receives a block or HAVE from the peer, so it may make sense to wire this change through at the networking layer. I propose:

Move detection of multiple Connected / Disconnected events into the networking layer
Move retry logic into the networking layer
If we fail to open a stream to a peer, put the peer into the Unresponsive state and fire the Disconnected event
When we receive a message from a peer, if it's in the Unresponsive state
- Move it to the Ok state
- Fire a Connected event
- Pass the message to Bitswap

Stebalien · 2020-04-14T23:23:28Z

That sounds like a great idea.

dirkmc · 2020-04-20T14:03:40Z

@Stebalien is this issue resolved? If so I will close

Stebalien · 2020-04-20T16:40:46Z

👍

Stebalien added the kind/bug A bug in existing code (including security flaws) label Apr 11, 2020

Stebalien mentioned this issue Apr 13, 2020

refactor: simplify messageQueue onSent #349

Merged

This was referenced Apr 16, 2020

Move connection management into networking layer #351

Merged

fix: mark wants sent when they are added to a message to be sent #355

Merged

Stebalien closed this as completed Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel/Want short-circuiting is racy #347

Cancel/Want short-circuiting is racy #347

Stebalien commented Apr 11, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 13, 2020

Stebalien commented Apr 13, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 13, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 13, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 14, 2020 •

edited

Loading

Stebalien commented Apr 14, 2020

dirkmc commented Apr 20, 2020

Stebalien commented Apr 20, 2020

Cancel/Want short-circuiting is racy #347

Cancel/Want short-circuiting is racy #347

Comments

Stebalien commented Apr 11, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 13, 2020

Stebalien commented Apr 13, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 13, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 13, 2020

Stebalien commented Apr 13, 2020

dirkmc commented Apr 14, 2020 • edited Loading

Stebalien commented Apr 14, 2020

dirkmc commented Apr 20, 2020

Stebalien commented Apr 20, 2020

dirkmc commented Apr 14, 2020 •

edited

Loading