Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No peers on fullnodes & proxies randomly #1101

Closed
daithi-coombes opened this issue Jul 8, 2020 · 9 comments
Closed

No peers on fullnodes & proxies randomly #1101

daithi-coombes opened this issue Jul 8, 2020 · 9 comments
Assignees
Milestone

Comments

@daithi-coombes
Copy link

Expected Behavior

Peer discovery would work and nodes would sync.

Current Behavior

Randomly fullnodes and peers will not sync when created first. They have been left running for over 24hrs. Also manually injecting peers (admin.addPeer), will increasee the peer count but p2p discovery doesn't seem to be happening.

3 out of 4 times the nodes will sync.

It appears almost as if its doing a port scan of target peers:

$ docker start celo-proxy && docker logs -f --tail 100 celo-proxy |& grep "Accepted connection"

TRACE[07-08|15:47:27.152] Accepted connection                      addr=34.73.132.106:40908
TRACE[07-08|15:48:02.158] Accepted connection                      addr=34.73.132.106:40914
TRACE[07-08|15:48:37.163] Accepted connection                      addr=34.73.132.106:40918
TRACE[07-08|15:49:12.168] Accepted connection                      addr=34.73.132.106:40926
TRACE[07-08|15:49:47.175] Accepted connection                      addr=34.73.132.106:40928
TRACE[07-08|15:50:22.179] Accepted connection                      addr=34.73.132.106:40936
TRACE[07-08|15:50:57.184] Accepted connection                      addr=34.73.132.106:40940
TRACE[07-08|15:51:32.190] Accepted connection                      addr=34.73.132.106:40946

(all above connections were rejected)

Steps to Reproduce Behavior

Unkown at the moment

Logs

The following can be seen in the logs (taken from a peer node):

TRACE[07-08|15:41:54.945] New dial task                            task="dyndial 88dbf37a6be7f92f 50.17.60.161:30303"                                                                                                         
WARN [07-08|15:41:59.789] Stats login failed                       err="login timed out"                                                                                                                                      
TRACE[07-08|15:42:04.945] Skipping dial candidate                  id=88dbf37a6be7f92f addr=50.17.60.161:30303   err="already dialing"                                                                                        
TRACE[07-08|15:42:09.945] Dial error                               task="dyndial 88dbf37a6be7f92f 50.17.60.161:30303" err="dial tcp 50.17.60.161:30303: i/o timeout"                                                          
TRACE[07-08|15:42:09.945] Dial task done                           task="dyndial 88dbf37a6be7f92f 50.17.60.161:30303"                 
TRACE[07-08|15:42:09.946] Skipping dial candidate                  id=88dbf37a6be7f92f addr=50.17.60.161:30303   err="recently dialed"
ERROR[07-08|15:42:10.087] SendDelegateSignMsgToProxiedValidator failed err="No Proxied Validator found"                                                                                                                       
TRACE[07-08|15:42:12.103] Accepted connection                      addr=34.73.132.106:40862            
TRACE[07-08|15:42:12.105] Failed RLPx handshake                    addr=34.73.132.106:40862  conn=inbound err="ecies: invalid message"                                                                                        
TRACE[07-08|15:42:12.105] Setting up connection failed             addr=34.73.132.106:40862  err="ecies: invalid message"             
TRACE[07-08|15:42:13.958] Skipping dial candidate                  id=88dbf37a6be7f92f addr=50.17.60.161:30303   err="recently dialed"
TRACE[07-08|15:42:14.945] Skipping dial candidate                  id=88dbf37a6be7f92f addr=50.17.60.161:30303   err="recently dialed"                                                                                        
WARN [07-08|15:42:15.087] Stats login failed                       err="login timed out"                                                                                                                                      
TRACE[07-08|15:42:24.945] Skipping dial candidate                  id=88dbf37a6be7f92f addr=50.17.60.161:30303   err="recently dialed"

System information

Celo
Version: 1.0.0-stable
Architecture: amd64
Protocol Versions: [65 64]
Go Version: go1.13.10
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

@tkporter
Copy link
Contributor

tkporter commented Jul 8, 2020

Just to confirm, is your port 30303 (or whichever port you are listening on your proxy) open for TCP & UDP?

@kevjue
Copy link
Contributor

kevjue commented Jul 8, 2020

@daithi-coombes - As I mentioned on discord, it would be help if you posted more trace level logs that span at least few hours.

@daithi-coombes
Copy link
Author

Here are some logs
celo-mainnet-bd-1-proxy-kr-json.log

@daithi-coombes
Copy link
Author

Also @tkporter the ports are open:

$ nc -zv 34.73.13.52 30303
found 0 associations
found 1 connections:
     1: flags=82<CONNECTED,PREFERRED>
        outif en0
        src 10.119.22.156 port 57533
        dst 34.73.13.52 port 30303
        rank info not available
        TCP aux info available
Connection to 34.73.13.52 port 30303 [tcp/*] succeeded!
$ nc -zvu 34.73.13.52 30303
found 0 associations
found 1 connections:
     1: flags=82<CONNECTED,PREFERRED>
        outif (null)
        src 10.119.22.156 port 54055
        dst 34.73.13.52 port 30303
        rank info not available
Connection to 34.73.13.52 port 30303 [udp/*] succeeded!

@daithi-coombes
Copy link
Author

One thing we noticed was that the peers that work have the ip address in the enode url when running admin.nodeInfo and the failing nodes had 127.0.0.1 in the enode url.

So I added -nat extip:$external_ip thinking that would solve it but unfortunatley not

@daithi-coombes
Copy link
Author

Still having issues, here's what we've tried:

  1. build bootnode from Celo's repo. Can't build or use anything because the go.mod is still saying ethereum/go-ethereum, some modules need to be celo-org/celo-blockchain (putting rewrite in go.mod got me half way). Built from master branch, not sure if this is correct branch to use?

  2. after hours building and ld going bonkers, decided screw that just get a synced proxies enode as bootnode. This worked until today, when both the foundations bootnode and our bootnode were full.

Next going to try adding enode's of our currently synced proxies as static-nodes.json or trusted-nodes.json (think static keeps retrying even if disconnected, and trusted will get peers but won't show up in peerCount)

@kevjue
Copy link
Contributor

kevjue commented Jul 14, 2020

Thanks for providing this info @daithi-coombes . We just went through our 2 month planning session, and we put dedicated time to investigate and fix these discovery issues.

  1. build bootnode from Celo's repo. Can't build or use anything because the go.mod is still saying ethereum/go-ethereum, some modules need to be celo-org/celo-blockchain (putting rewrite in go.mod got me half way). Built from master branch, not sure if this is correct branch to use?

Are you doing your tests on the mainnet network? If so, please build against our v1.0.0 version: https://github.com/celo-org/celo-blockchain/tree/celo-v1.0.0

2. after hours building and ld going bonkers, decided screw that just get a synced proxies enode as bootnode. This worked until today, when both the foundations bootnode and our bootnode were full.

I suspect that your proxy ran out of eth level connections. However, your proxies should be sharing information about peers that it knows to your full node, and that is one thing we are planning to investigate.

Next going to try adding enode's of our currently synced proxies as static-nodes.json or trusted-nodes.json (think static keeps retrying even if disconnected, and trusted will get peers but won't show up in peerCount)

Yes, you are correct. Static connections will keep retrying. If you really want your full node to connect to your proxy, then you will need to add your full node as a trusted connection on your proxy. That way, your full node will not be counted against peercount on your proxy.

@oneeman
Copy link
Contributor

oneeman commented Dec 15, 2020

@daithi-coombes Looking at the above, what caught my attention was the following lines from the peer node:

TRACE[07-08|15:42:12.103] Accepted connection                      addr=34.73.132.106:40862            
TRACE[07-08|15:42:12.105] Failed RLPx handshake                    addr=34.73.132.106:40862  conn=inbound err="ecies: invalid message"                                                                                        
TRACE[07-08|15:42:12.105] Setting up connection failed             addr=34.73.132.106:40862  err="ecies: invalid message"    

Did this error also appear in the other cases? If so, the problem is not with peer discovery or connection, but in the handshake, in one of the two Decrypt() calls in this function:

celo-blockchain/p2p/rlpx.go

Lines 469 to 498 in 19f51eb

func readHandshakeMsg(msg plainDecoder, plainSize int, prv *ecdsa.PrivateKey, r io.Reader) ([]byte, error) {
buf := make([]byte, plainSize)
if _, err := io.ReadFull(r, buf); err != nil {
return buf, err
}
// Attempt decoding pre-EIP-8 "plain" format.
key := ecies.ImportECDSA(prv)
if dec, err := key.Decrypt(buf, nil, nil); err == nil {
msg.decodePlain(dec)
return buf, nil
}
// Could be EIP-8 format, try that.
prefix := buf[:2]
size := binary.BigEndian.Uint16(prefix)
if size < uint16(plainSize) {
return buf, fmt.Errorf("size underflow, need at least %d bytes", plainSize)
}
buf = append(buf, make([]byte, size-uint16(plainSize)+2)...)
if _, err := io.ReadFull(r, buf[plainSize:]); err != nil {
return buf, err
}
dec, err := key.Decrypt(buf[2:], nil, prefix)
if err != nil {
return buf, err
}
// Can't use rlp.DecodeBytes here because it rejects
// trailing data (forward-compatibility).
s := rlp.NewStream(bytes.NewReader(dec), 0)
return buf, s.Decode(msg)
}
.

Have you encountered this problem since?

@oneeman
Copy link
Contributor

oneeman commented Jan 15, 2021

Closing this for lack of information for further investigation. Will reopen if we get more information in the future.

@oneeman oneeman closed this as completed Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants