Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

Cancel/Want short-circuiting is racy #347

Closed
Stebalien opened this issue Apr 11, 2020 · 12 comments
Closed

Cancel/Want short-circuiting is racy #347

Stebalien opened this issue Apr 11, 2020 · 12 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@Stebalien
Copy link
Member

Cancel Race:

  1. Enqueue want.
  2. Prepare message with want.
  3. Cancel want. This won't enqueue a cancel, it will remove the want from the pending list.
  4. Send the message with the want.
  5. Remove the want from pending, add to sent.

However, the inverse isn't an issue:

  1. Enqueue cancel.
  2. Prepare message with cancel. This will remove the cancel from pending immediately.
  3. Re-add want.
  4. Send cancel.

So, I think the solution here is to either:

  1. Eagerly move wants/cancels from pending to sent and move them back if something goes wrong.
  2. Move wants from pending to sent after the fact, but enqueue a cancel if we notice that the want is no longer there.

IMO, the second solution is probably the "more correct" solution.

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label Apr 11, 2020
@Stebalien
Copy link
Member Author

Really, if we fail to send, we should drop the peer and walk away. We can do this by:

  1. Stopping the queue.
  2. Marking the peer as "dead".

That is, we won't forget about the peer. Otherwise, we'll mess up our reference counting. Instead, we just mark it as dead and move on.

I'm pretty sure a large part of our issue has been peers that literally don't support the bitswap protocol. We'll currently keep retrying over and over.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 13, 2020

You're right, nice catch 👍

An alternative that occurs to me is that when we call MessageQueue.AddCancels([]cid.Cid), we can check if there is a corresponding want in the message that is currently being sent.
BitSwapMessage stores the wantlist as a map, so we could add a method BitSwapMessage.FilterWantsFor(cancelCids).
For each cancel

  • if it's pending
    • if it's currently being sent (it's in the message wantlist): enqueue cancel
    • otherwise remove it from pending
  • otherwise enqueue cancel

@Stebalien
Copy link
Member Author

I think we should be disconnecting from these peers regardless. I'd also like to reduce the number of ad-hoc checks.

@Stebalien
Copy link
Member Author

Sorry, not disconnecting. We should start ignoring these peers.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 13, 2020

Ah so you're saying we should just eagerly move wants/cancels from pending to sent, and if something goes wrong when trying to send the message we just shut down the message queue for the peer?

@Stebalien
Copy link
Member Author

Yeah, basically. We could also try reviving the peer (from scratch) we receive a new "connected" event. But that's a lower priority. It's just that the current logic is really unfriendly to non-ipfs peers as we'll repeatedly hammer them with new streams even if they don't speak the protocol.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 13, 2020

Makes sense 👍

Once they've failed a few dial attempts, do you think we should try to reconnect with a backoff, or just put them in a blacklist forever?

@Stebalien
Copy link
Member Author

For now, I just want to drop them (and re-activate when we receive a new "connect" event).

@dirkmc
Copy link
Contributor

dirkmc commented Apr 14, 2020

If the peer doesn't respond for a long time, I think we actually need to simulate it having "disconnected" so that it's removed from the session. This allows the session to stop sending want-block to the peer and to search for better peers.

The session will re-add the peer if it receives a block or HAVE from the peer, so it may make sense to wire this change through at the networking layer. I propose:

  • Move detection of multiple Connected / Disconnected events into the networking layer
  • Move retry logic into the networking layer
  • If we fail to open a stream to a peer, put the peer into the Unresponsive state and fire the Disconnected event
  • When we receive a message from a peer, if it's in the Unresponsive state
    • Move it to the Ok state
    • Fire a Connected event
    • Pass the message to Bitswap

@Stebalien
Copy link
Member Author

That sounds like a great idea.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 20, 2020

@Stebalien is this issue resolved? If so I will close

@Stebalien
Copy link
Member Author

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

2 participants