-
Notifications
You must be signed in to change notification settings - Fork 20.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p/discover: bootnodes not retried often enough when table becomes empty #20630
Comments
This is definitely known, but it's also expected behavior. Discovery cannot work without bootstrapping and the lookup iterator will block until the table contains some nodes to do the lookup with. We have a rewrite of the dialer coming up in #20592, but that PR mostly improves the latency of acting on discovery results. If there are no discovery results, no peer connections can be created. |
Maybe I am misunderstanding your issue. AFAIK the lookupIterator will try to access the bootnodes repeatedly if they do not work. Please provide log output of |
Thanks for the response. I see Here are the logs you asked for when running I think I see what you're saying about go-ethereum/p2p/discover/lookup.go Line 194 in 2c37142
go-ethereum/p2p/discover/v4_udp.go Line 418 in 2c37142
I haven't taken time to look why that wasn't working, but it feels to me like it wasn't working as I expect it should because I saw an infinite loop between go-ethereum/p2p/discover/lookup.go Line 193 in 2c37142
go-ethereum/p2p/discover/lookup.go Line 197 in 2c37142
I can try to look more into what the issue is there when I have some time |
Thank you for the logs. It looks like the part that's broken is that |
@fjl I was looking at this and thought that it should have been fixed by #21396 (though it doesn't reinsert the bootnoodes), since that change means the bootnode won't be dropped there if it's the only node in the table (which is what happens in this scenario). But I found that it didn't fix the issue, and after looking into it I saw (from the logs saying go-ethereum/p2p/discover/table.go Lines 348 to 354 in fb2c79d
I figured it would make sense to change this, to go along with the change in #21396, and I could submit a PR for it, but I wasn't sure what the exact thing to do is, whether it's to avoid removing it (a) if it's the only one in the bucket and there are no replacements, (b) if the bucket is less than half-full as in #21396 and there are no replacements, or (c) something else. So I thought I'd ask to see what your thoughts on it were before proceeding. Thanks! |
I have been noticing a related bug which might have similar root cause to this one but might not.
Expected behavior: Is that the same issue as this, or should it be spun off as a separate one? |
Any thoughts on pulling in oneeman's fix or response to the question above? From my perspective (not a geth maintainer) option (b) seems to make the most sense - I'd want the bootnodes specified at startup to remain in the list and continue to be retried even if they are occasionally offline, but we don't want the table filled up completely with dead bootnodes while other live nodes are available on the network (unless |
System information
Geth version: a fork of
1.9.8-stable
OS & Version: OSX
Commit hash :
d62e9b285777c036c108b89fac0c78f7855ba314
Expected behaviour
A node should repeatedly try using a bootnode for discovery if the bootnode is not accessible at node startup
Actual behaviour
When the bootnode is unreachable at the start of the node, one
discoverTask
is created to the bootnode and never again. Looking at #20573, I'm not sure if this is already a known issue, but this is closely related.Detailed breakdown:
When starting a
discoveryTask
,lookupRunning
is set to true in this if statement:go-ethereum/p2p/dial.go
Line 184 in 976a0f5
So no other discoverTasks can be run at the same time. However, the discoverTask never finishes.
The task begins in
Do()
:go-ethereum/p2p/dial.go
Line 325 in 976a0f5
which calls ReadNodes():
go-ethereum/p2p/enode/iter.go
Line 36 in 976a0f5
ReadNodes uses the
FairMix
iterator, and calls Next():go-ethereum/p2p/enode/iter.go
Line 203 in 976a0f5
Next will block for a value from
source.next
here:go-ethereum/p2p/enode/iter.go
Line 218 in 976a0f5
Which is given by a separate goroutine
runSource
:go-ethereum/p2p/enode/iter.go
Line 276 in 976a0f5
runSource
uses another iterator, which is thelookupIterator
in this case. I found that thelookupIterator
Next()
will result in an infinite loop that will caserunSource
to hang, and subsequently the channelsource.next
to never receive a value.The infinite loop happens here
go-ethereum/p2p/discover/lookup.go
Line 209 in 976a0f5
This causes
FairMix
'sNext()
to time out, which is expected-- the only bad part is that the timeout callsm.nextFromAny()
which reads fromm.fromAny
here:go-ethereum/p2p/enode/iter.go
Line 241 in 976a0f5
runSource
go-ethereum/p2p/enode/iter.go
Line 283 in 976a0f5
Next()
call.Steps to reproduce the behaviour
Start a node with
--bootnode
pointing to a bootnode that is inaccessible, then make the bootnode accessible & the node will not send anydiscoverTask
s to the bootnode.The text was updated successfully, but these errors were encountered: