Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

(Staging) Discovery not working properly #1265

Closed
skylenet opened this issue Feb 22, 2019 · 13 comments
Closed

(Staging) Discovery not working properly #1265

skylenet opened this issue Feb 22, 2019 · 13 comments
Assignees
Labels

Comments

@skylenet
Copy link
Contributor

skylenet commented Feb 22, 2019

This is currently happening on our swarm-private deployment where we have:

  • 1 swarm bootnode
  • 30 swarm nodes

Logs:

$ kubectl -n staging logs -f swarm-private-1
DEBUG[02-22|14:56:45.294|p2p/server.go:733]                  Adding p2p peer                          name=swarm/v0.3.12-d9adcd... addr=10.0.107.93:30301                          peers=1
DEBUG[02-22|14:56:45.295|swarm/network/protocol.go:200]      peer created                             addr="3d8e8be887a075648e69f37e299b8cb3182d58249518e39eb04198bf6e166001 <enode://846c424961adc146d54861bdf1eb6015e6908b689fd12d01c61307fffc848c22e514f5c898dc9243fbb17aa80750b556772599d84fe86a4b715f40ebc4c049bf@10.0.107.93:30301?discport=0>"
DEBUG[02-22|14:56:48.835|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:56:49.347|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:56:50.339|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:56:50.851|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:56:52.359|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:56:58.467|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:57:01.784|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:57:02.339|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
DEBUG[02-22|14:57:03.784|p2p/dial.go:317]                    Can't resolve node                       id=0x97e090         err="discovery is disabled"
...

admin.peers

$ kubectl -n staging exec -it swarm-private-1 -- ./geth attach /root/.ethereum/bzzd.ipc --exec admin.peers.length
1

bzz.hive:

$ kubectl exec -n staging -ti swarm-private-1 -- /geth --exec="console.log(bzz.hive)" attach /root/.ethereum/bzzd.ipc

=========================================================================
commit hash: d9adcd3a27cb14042bdb230f073487192683390c
Fri Feb 22 15:00:02 UTC 2019 KΛÐΞMLIΛ hive: queen's address: 1becab
population: 1 (33), NeighbourhoodSize: 2, MinBinSize: 2, MaxBinSize: 4
============ DEPTH: 0 ==========================================
000  0                              | 12 849a (7) bf38 (7) a934 (7) a5b9 (7)
001  0                              | 10 7381 (7) 71b1 (7) 676b (7) 6b9b (7)
002  1 3d8e                         |  6 28b4 (7) 2d98 (7) 2c65 (7) 2fdc (7)
003  0                              |  3 0bc2 (7) 0b5d (7) 013b (7)
004  0                              |  1 1628 (7)
005  0                              |  0
006  0                              |  1 1919 (7)
007  0                              |  0
008  0                              |  0
009  0                              |  0
010  0                              |  0
011  0                              |  0
012  0                              |  0
013  0                              |  0
014  0                              |  0
015  0                              |  0
=========================================================================
@skylenet skylenet added the bug label Feb 22, 2019
@zelig
Copy link
Member

zelig commented Feb 23, 2019

@nolash @skylenet please check if this is the result of ethereum/go-ethereum#19137
maybe on the cluster this wont be correct

@nolash
Copy link
Contributor

nolash commented Feb 23, 2019

@skylenet the peer is not available on the ip used on the incoming interface?

@skylenet skylenet self-assigned this Feb 25, 2019
@skylenet
Copy link
Contributor Author

skylenet commented Feb 25, 2019

@nolash @skylenet please check if this is the result of ethereum#19137
maybe on the cluster this wont be correct

Maybe.... but in that case I find it weird that we have such a peer distribution. I would expect more a "it works on all or none" scenario.

image

@skylenet the peer is not available on the ip used on the incoming interface?

It is available.

@skylenet
Copy link
Contributor Author

Running logs on verbose, I see a lot of:

TRACE[02-25|11:06:14.396|p2p/dial.go:299]                                 Dial error                               task="staticdial 184adfc2cd7a5cd4 10.0.25.201:30399"  err="dial tcp 10.0.25.201:30399: i/o timeout"
TRACE[02-25|11:06:14.396|p2p/server.go:704]                               Dial task done                           task="staticdial 184adfc2cd7a5cd4 10.0.25.201:30399"
TRACE[02-25|11:06:19.099|p2p/dial.go:299]                                 Dial error                               task="staticdial 8d0e00675953a829 10.0.28.227:30399"  err="dial tcp 10.0.28.227:30399: i/o timeout"
TRACE[02-25|11:06:19.099|p2p/server.go:704]                               Dial task done                           task="staticdial 8d0e00675953a829 10.0.28.227:30399"
TRACE[02-25|11:06:19.099|p2p/dial.go:299]                                 Dial error                               task="staticdial de7d07ef981ca030 10.0.5.198:30399"   err="dial tcp 10.0.5.198:30399: i/o timeout"
TRACE[02-25|11:06:19.099|p2p/server.go:704]                               Dial task done                           task="staticdial de7d07ef981ca030 10.0.5.198:30399"
TRACE[02-25|11:06:19.099|p2p/dial.go:299]                                 Dial error                               task="staticdial a02634f8e64eeebe 10.0.24.197:30399"  err="dial tcp 10.0.24.197:30399: i/o timeout"
TRACE[02-25|11:06:19.100|p2p/server.go:704]                               Dial task done                           task="staticdial a02634f8e64eeebe 10.0.24.197:30399"
TRACE[02-25|11:06:34.099|p2p/dial.go:299]                                 Dial error                               task="staticdial 152072eb977937ec 10.0.49.77:30399"   err="dial tcp 10.0.49.77:30399: i/o timeout"
TRACE[02-25|11:06:34.099|p2p/server.go:704]                               Dial task done                           task="staticdial 152072eb977937ec 10.0.49.77:30399"
TRACE[02-25|11:06:37.155|p2p/dial.go:299]                                 Dial error                               task="staticdial f6dacc59c9f84323 10.0.115.1:30399"   err="dial tcp 10.0.115.1:30399: connect: no route to host"
TRACE[02-25|11:06:37.156|p2p/server.go:704]                               Dial task done                           task="staticdial f6dacc59c9f84323 10.0.115.1:30399"
TRACE[02-25|11:06:37.155|p2p/dial.go:299]                                 Dial error                               task="staticdial a3c272bfbfc95bd3 10.0.109.183:30399" err="dial tcp 10.0.109.183:30399: connect: no route to host"
TRACE[02-25|11:06:37.156|p2p/server.go:704]                               Dial task done                           task="staticdial a3c272bfbfc95bd3 10.0.109.183:30399"
TRACE[02-25|11:06:40.227|p2p/dial.go:299]                                 Dial error                               task="staticdial ab8fab0858d9683c 10.0.127.26:30399"  err="dial tcp 10.0.127.26:30399: connect: no route to host"
TRACE[02-25|11:06:40.228|p2p/server.go:704]                               Dial task done                           task="staticdial ab8fab0858d9683c 10.0.127.26:30399"
TRACE[02-25|11:06:40.227|p2p/dial.go:299]                                 Dial error                               task="staticdial babdbc768aca81f0 10.0.86.157:30399"  err="dial tcp 10.0.86.157:30399: connect: no route to host"
TRACE[02-25|11:06:40.228|p2p/server.go:704]                               Dial task done                           task="staticdial babdbc768aca81f0 10.0.86.157:30399"
TRACE[02-25|11:06:43.299|p2p/dial.go:299]                                 Dial error                               task="staticdial 404679ae00e5130b 10.0.72.229:30399"  err="dial tcp 10.0.72.229:30399: connect: no route to host"
TRACE[02-25|11:06:43.299|p2p/server.go:704]                               Dial task done                           task="staticdial 404679ae00e5130b 10.0.72.229:30399"
TRACE[02-25|11:06:44.451|p2p/dial.go:299]                                 Dial error                               task="staticdial bc3372da326e61c2 10.0.81.219:30399"  err="dial tcp 10.0.81.219:30399: connect: no route to host"
TRACE[02-25|11:06:44.452|p2p/server.go:704]                               Dial task done                           task="staticdial bc3372da326e61c2 10.0.81.219:30399"
TRACE[02-25|11:06:46.339|p2p/dial.go:299]                                 Dial error                               task="staticdial f10e590ccce5038b 10.0.82.209:30301"  err="dial tcp 10.0.82.209:30301: connect: no route to host"
TRACE[02-25|11:06:46.340|p2p/server.go:704]                               Dial task done                           task="staticdial f10e590ccce5038b 10.0.82.209:30301"
TRACE[02-25|11:06:47.523|p2p/dial.go:299]                                 Dial error                               task="staticdial 0d84ba7a43ff9371 10.0.83.245:30399"  err="dial tcp 10.0.83.245:30399: connect: no route to host"
TRACE[02-25|11:06:47.524|p2p/server.go:704]                               Dial task done                           task="staticdial 0d84ba7a43ff9371 10.0.83.245:30399"
TRACE[02-25|11:06:49.100|p2p/dial.go:299]                                 Dial error                               task="staticdial 8568ce91c3ab62bf 10.0.81.105:30399"  err="dial tcp 10.0.81.105:30399: i/o timeout"
TRACE[02-25|11:06:49.100|p2p/server.go:704]                               Dial task done                           task="staticdial 8568ce91c3ab62bf 10.0.81.105:30399"
TRACE[02-25|11:06:49.100|p2p/dial.go:299]                                 Dial error                               task="staticdial fb20889b94303cbf 10.0.14.160:30399"  err="dial tcp 10.0.14.160:30399: i/o timeout"
TRACE[02-25|11:06:49.101|p2p/server.go:704]                               Dial task done                           task="staticdial fb20889b94303cbf 10.0.14.160:30399"
TRACE[02-25|11:06:49.101|p2p/dial.go:299]                                 Dial error                               task="staticdial 72f704d4e3197d91 10.0.56.101:30399"  err="dial tcp 10.0.56.101:30399: i/o timeout"
TRACE[02-25|11:06:49.101|p2p/server.go:704]                               Dial task done                           task="staticdial 72f704d4e3197d91 10.0.56.101:30399"
TRACE[02-25|11:06:49.100|p2p/dial.go:299]                                 Dial error                               task="staticdial 9e8e8138023e49bb 10.0.72.75:30399"   err="dial tcp 10.0.72.75:30399: i/o timeout"
TRACE[02-25|11:06:49.101|p2p/server.go:704]                               Dial task done                           task="staticdial 9e8e8138023e49bb 10.0.72.75:30399"

These IP addresses are not assigned to any containers. These could be addresses that existed in the past.

@skylenet
Copy link
Contributor Author

We can see that swarm-1 tries to connect to an enode using IP 10.0.72.102:

TRACE[02-25|11:16:36.172|swarm/network/kademlia.go:550]                   1becab9f: 28b4d663db89e4e94dd17150bcbea59b9a1d5f7bcbb85aaca5ecdd875844750e u003cenode://0d79c2c33bc1e9b88db929a6b97c929f0dd6bedee6801b9730cb9d79d4b0772be4b3ee7acea9a27a5ac264a67a97e4f6c193ac474d9de2e70bdbb270936a0b48@10.0.72.102:30399?discport=0u003e long time since last try (at 8388079579064) needed before retry 12, wait only warrants 11 

However, that node (swarm-9) is running with a different IP address, which is 10.0.73.80

kubectl -n staging exec -it swarm-private-9 -c swarm -- /geth attach /root/.ethereum/bzzd.ipc --exec admin.nodeInfo.enode
"enode://0d79c2c33bc1e9b88db929a6b97c929f0dd6bedee6801b9730cb9d79d4b0772be4b3ee7acea9a27a5ac264a67a97e4f6c193ac474d9de2e70bdbb270936a0b48@10.0.73.80:30399?discport=0"

@nolash
Copy link
Contributor

nolash commented Feb 25, 2019

These could be addresses that existed in the past.

@skylenet what does that mean, in the past? When you start a new cluster, none of the previous data is retained, right?

I see also that for example the 0d84ba7a43ff9371 node key doesn't exist in the diagram you posted, so it's not only the ip.

@skylenet
Copy link
Contributor Author

These could be addresses that existed in the past.

@skylenet what does that mean, in the past? When you start a new cluster, none of the previous data is retained, right?

No. On our staging cluster we're actually persisting all the data. Nodes will change their IP addresses when redeployed.

@nolash
Copy link
Contributor

nolash commented Feb 25, 2019

Oh. Perhaps you should flush the nodes database, even if you retain the chunks...

@zelig
Copy link
Member

zelig commented Feb 25, 2019

well, if IP changes and nodes are persisted, no wonder.
Deleting the nodes db is not a good idea in general cos it will then lose any external connections (of which there may be none)

But in fact, IPs should be updated on existing records of the address book, so best to keep the persisted node db.

I created #1268

@skylenet
Copy link
Contributor Author

@zelig what happens when my node has --maxpeers=10 and I've got 10 broken peers (because the IP changed) in my local database? Won't I receive any new peers from now on?

@nolash
Copy link
Contributor

nolash commented Feb 25, 2019

Deleting the nodes db is not a good idea in general cos it will then lose any external connections (of which there may be none)

On the private deployment this shouldn't be an issue? So for private we can delete and get a quick and dirty verification of whether the issue persists.

@skylenet
Copy link
Contributor Author

skylenet commented Feb 25, 2019

Check this node (swarm-private-26) that has only 1 peer:

~ kubectl -n staging exec -it swarm-private-26 -- ./geth attach /root/.ethereum/bzzd.ipc --exec admin.peers
[{
    caps: ["bzz/8", "hive/8", "pss/2", "stream/8"],
    enode: "enode://466216c9aad97c91ba0ad430763e247bfced07ad1fae3b962f7dc6f20f06c2261d80f050f06dca9023028d698f05d984587fbf8c24f9f6239bdbdde5d2983124@10.0.6.238:30399?discport=0",
    id: "5d8517599412374b94c41a30a4c8e8524eb048e269f6e159a43d9d4724b56696",
    name: "swarm/v0.3.12-2b573e21/linux-amd64/go1.11.5",
    network: {
      inbound: false,
      localAddress: "10.0.48.226:58162",
      remoteAddress: "10.0.6.238:30399",
      static: true,
      trusted: false
    },
    protocols: {
      bzz: "unknown",
      hive: {
        OAddr: "0x5d8517599412374b94c41a30a4c8e8524eb048e269f6e159a43d9d4724b56696",
        UAddr: "0x656e6f64653a2f2f34363632313663396161643937633931626130616434333037363365323437626663656430376164316661653362393632663764633666323066303663323236316438306630353066303664636139303233303238643639386630356439383435383766626638633234663966363233396264626464653564323938333132344031302e302e362e3233383a33303339393f64697363706f72743d30"
      },
      pss: "unknown",
      stream: "unknown"
    }
}]

Here's a dump from the state-store , key peers:

~ level-dump /tmp/swarm-private-26/bzz-b895a049bacd679c0e9511eafef670fe731e0bfd/state-store.db -v -k false | grep OAddr | jq -r '.[].UAddr' | base64 -d | sed 's/enode/\n&/g' | sort

enode://0604bd38b193942266f28ca78c8e157b5c2ec9c87728988955faea56265c484b5b79b667f1a5302a8dc99104ca1adff9b71f5b99af998c755213f4900146deb7@10.0.72.75:30399?discport=0
enode://0d79c2c33bc1e9b88db929a6b97c929f0dd6bedee6801b9730cb9d79d4b0772be4b3ee7acea9a27a5ac264a67a97e4f6c193ac474d9de2e70bdbb270936a0b48@10.0.72.102:30399?discport=0
enode://137bbe524a1304e3c5165d52b3d5b20fe4929a53b594e77238abe877a80b31d181a738a3e984273e33f97314b970e9033230b34dd63df4a086c4d2d2fe169100@10.0.25.201:30399?discport=0
enode://1b2c418b1835e10d65f10e934cbadb4884e182cfb4145439587bc952c3923d60d879a0bcf74a1b48f0449d86b0148280fce06ad8520af72babedadb480b34adb@10.0.115.1:30399?discport=0
enode://1f1d26d05f2a05d4e59a910c2e383a3b344854b4fb1530620ba082a746995744102e07349697b6b57a7d03f0e2695c9132d4d3878f672d23915bd553e2b7dad1@10.0.13.202:30399?discport=0
enode://41b08e41a540fd0b53d73376f89120270c9e4eb2b6cc068df5952f701adde4f858e104b63fca1ea289ee77293b9323c32fcc2f5a0ae2ad8caf7fcb031a869782@10.0.51.109:30399?discport=0
enode://41cc0660d467d934e881d3c9016c3a1e3930459b84ae4aca1a91bd67693500dbc763058c49217afb08e9bd1188b32b1d50317e5e623e66a98e688e944d4f301f@10.0.86.157:30399?discport=0
enode://44bbe1d24f870a0b33e518aed492e0994858238542b040a2fd28e133ac8a8dff96f641afcf890fd8e7b883ec5f9c6c3392009951446140156fb64a6cbea7b766@10.0.58.84:30399?discport=0
enode://46393f33cd9088787b55861b1ac34bb4ac8395f5d8740a2560a8d336dd1769829eb87abc3899ce921176e288648a95766f8c6d7bee52deb5de16ba359af667cd@10.0.81.105:30399?discport=0
enode://4c7d27b5d0fdbde19536d588c23710216eee233748e7ff4a9306cfaa97956190c487ec62629a2f4892cac5143d684de94b7e95b409f9febf171c647495f81b3a@10.0.115.220:30399?discport=0
enode://6126be8a1dfcd83a511291d216201422e67eb82e2d3a43325d2b80790ad8bee67b4752eef2c06debaebf9d7edfb850b1ea3672d1e926c241facc5c0ca75a938a@10.0.28.227:30399?discport=0
enode://7872befa36c0f025f72cbdb745cf8439eab21967f82d51c150db20256c1d3c6625bed5e4a1ecdef089e3af5195e3f3c0ec1d3e6d9c975cfeb5ce1f2cdf741ea3@10.0.106.97:30399?discport=0
enode://846c424961adc146d54861bdf1eb6015e6908b689fd12d01c61307fffc848c22e514f5c898dc9243fbb17aa80750b556772599d84fe86a4b715f40ebc4c049bf@10.0.25.10:30301?discport=0
enode://846c424961adc146d54861bdf1eb6015e6908b689fd12d01c61307fffc848c22e514f5c898dc9243fbb17aa80750b556772599d84fe86a4b715f40ebc4c049bf@10.0.50.42:30301?discport=0
enode://846c424961adc146d54861bdf1eb6015e6908b689fd12d01c61307fffc848c22e514f5c898dc9243fbb17aa80750b556772599d84fe86a4b715f40ebc4c049bf@10.0.62.10:30301?discport=0
enode://846c424961adc146d54861bdf1eb6015e6908b689fd12d01c61307fffc848c22e514f5c898dc9243fbb17aa80750b556772599d84fe86a4b715f40ebc4c049bf@10.0.82.209:30301?discport=0
enode://878db01c1595a183a574db06bc7b93d8257c293d89350a2ed8443ced04d8adb6bb8b084271432373eaac38ff00184f18ec5586dce3e834631db7271fb5092daf@10.0.10.14:30399?discport=0
enode://9a71e8af5a8b49bbd63223c6bf7eb9755e1791174549f5bf54fe487c9366d20a9bf04fe34bd745a44c097b4ab2a71dbea60cf90471b3ff0ecb27007dc81844e5@10.0.54.119:30399?discport=0
enode://9e3fdf1e8daeac7125a8d053fd0740fa5a6260b5cd409a8a3f0e1063f29bd3ed9829f6195b1967097d3c677219a054d9f686000b9ff4077b817480a99c56462b@10.0.63.10:30399?discport=0
enode://a5e8bc2052241bf1409a5ae228f886790dcbba87471be8dfcd2a738f2988d84a5a2dfc0ea7a694d5ba50425b8ed7d9f76c5ce98784060a65b06b237d4a109945@10.0.33.66:30399?discport=0
enode://ae697baedbecd21cd70d3100211ef4d5c135743ac099e51a82bc487bd401c36fed48bddd81b9b20dc9f4861ecc6ba6227efcb4f3083cf781de047c412ce8e742@10.0.56.101:30399?discport=0
enode://af7fd36034d3cb2179a5bd1d96990b5de349bba5ed501cb08e2c2299f89089e23df8b054df2e984cb27c79198ee890144322d5809ec177d5d5d95d3a86869aef@10.0.65.66:30399?discport=0
enode://b142c44a2a72da984f48a4c96b6c9afa818b7c80f6de628263cb2d3a84a18d70ab6b8120be2fbbf4726b1e92872b233d878d98bb6f74ca47cce6c5bff535d1cc@10.0.17.96:30399?discport=0
enode://b30b4ebcf6b87b1827811edaea5855fc05b716b7f4d6800f30e68ce3b9fc19d2e0d41be70aee98f6bc051890bee8539f9c0cd449745c0585867e7e5c5787ca24@10.0.120.187:30399?discport=0
enode://b5b15b1e8f82da4d5ebe3fe3e8f9e0b50bb1c3268732b79f92af76121a83ed2f2b755cb03d870fb78e76ed56ee23719c40a89f747303e93bd23ca4826f05781d@10.0.44.245:30399?discport=0
enode://b68216bc3aaf6d62b487108035ac2d6a3bfaf96d0651d51249c0cbb8cc84732441457bf57e00a2d51e6e9869b3173be9fabede58e235b31b327f20567d754759@10.0.81.219:30399?discport=0
enode://c0cfe6292885c84db22f6a34baab432b4a585ed46a937233fd6b155981362bc278c4193be26785c6272fea21debd39d12b8d2d56dc29bda0d98a0aac6ce09101@10.0.3.99:30399?discport=0
enode://d51b2a010c19cba086067eddd6ba6eb711bb9b8c93ec4f399da8feb7527f572884dcd671eb41e5839f4addefa1c2439ff46013ae1045befdef2723695eaf3943@10.0.109.183:30399?discport=0
enode://dd98eafb2ee0558eb3b7ffa8ef37e700c0e48e8ad8e4fc46172b056df88370f7807005b0e2b149e498a9422809015bd091b284b56c737f7a421e0d09a8cfc5d8@10.0.112.74:30399?discport=0
enode://de7de5716157822971def6aa1fe17aa9255ac9f67d97cb0e3e3006d8dfd96b2eafb97f121495b55d1f69bdfbc9a97d2aad1e13bb17c08ac174a10f622744dc29@10.0.96.188:30399?discport=0
enode://ef134db4a1c7bcf279ceffd3aa7bcd9deb9b7db392463c38dd044b4ac00204ffa11326be501b30607ab035f58d37c988cc96ef0770a931a3f93cf2b14458ec8b@10.0.126.7:30399?discport=0
enode://f8ef1942657f3fd863f8072588088326e01d5255c2c4defe89c0d93f78827c59fa95123186d9c6341aa2790471943c280869b78724a744ffdf23a86017d8c79c@10.0.72.243:30399?discport=0
enode://faeb44aff96ad64e3a123bdf662d9c05b415dca6a6000b9ad0271ca81f2d3c4a27b46c8048fd9af60c2c780c0e6a8715d440bf6ad3f17f5ef66b16f30c0535e4@10.0.49.77:30399?discport=0

You can see that 846c4..., which is the bootnode, appears multiple times with different IP addresses. Also note that all these IP addresses are not assigned anymore.

So when the Hive starts and calls loadPeers(), it will load a bunch of broken peers.

@skylenet
Copy link
Contributor Author

skylenet commented Mar 4, 2019

Fix: ethereum/go-ethereum#19189

@skylenet skylenet closed this as completed Mar 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants