-
-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore non unique route updates #2266
Ignore non unique route updates #2266
Conversation
Looking forward for this fix so that we can reenable the HA feature in our setup! I am not super familiar with go and netbird source code so I am not able to provide proper feedback wether the proposed solution is OK and covers all cases but at the first glance it seems reasonable. Maybe adding a test case for the problem this solution fixes would ensure that in the future, this issue won't come up as a regression? Alternatively, rather than fixing this in the handleUpdate function as proposed in this PR, the following algorithm could be reevaluated: getBestRouteFromStatuses, where there is already:
Which seems like a solution that does not fix this as of now, as in the linked issue logs, the routes are changing constantly. Maybe some rounding should be added to the score calculation and a condition to not consider route switching which scores are just marginally different (as in the logs inside the linked issue). |
I have tried playing with the threshold, and in all honesty it will decrease the number of times you hop from peer to peer, but not really a fix. I need to add a backoff routine to the statewatcher so as to avoid firing when connected before there is a latency value |
Perhaps another addition would be to make the threshold before it jumps to another routing peer configurable, but any number of network conditions can affect latency negatively and force the clients to rebalance. |
@LeszekBlazewski what do you mean the routes are constantly changing? Is the routing peer having new network routes added via API as new networks are provision in AWS? |
@hurricanehrndz No, the logs in the linked issue, describe the following scenario: There are 3 sets of HA routing peers (1 set for each Kubernetes cluster), each set consists of 2 netbird clients which are running in Kubernetes pods as docker containers which are spread across 2 availability zones (so each netbird peer runs on a different EC2 in a different availability zone). Each set of netbird peers is responsible for routing traffic inside the VPC in which the Kubernetes cluster runs. This essentially means that each network route I have configured is HA where the group which handles the traffic is one of the netbird client set (So I have 3 HA routes in total). Additionally, temporary we have an exit node which is one of routing sets. This can be observer in the screenshot in comment here: #2150 (comment) The above sets of netbird pods in the 3 different EKS clusters are not reprovisioned and nothing is changed. Simply from time to time when I was trying to connect into netbird (from my macos client), I have observed a scenario where my MacOS client couldn't decide which peer to connect to and was constantly switching routes between 2 peers inside the routing group for given set. This occurred only sometimes and was not consistent but did happen every now and then when connecting to netbird. When this happened, due to the constant switching, I wasn't able to access the underlying resources properly because more than half of the requests were failing and only few of those went through. The logs just kept on going until I disconnected with my MacOS client and then reconnected (from time to time the issue reoccured tho). |
Yes, I understand the problem clearly since I have the same issue within my own AWS environment. In my own environment I have noticed that the management endpoint will send route updates whenever a new client connects. When this occurs So my patch ignores all routing updates, that are not actually unique. I am not sure if in the end this will be the end all and be all solution, I only purpose this in order to give the NB devs inspiration.
So by insuring |
The other potential solution I have been thinking about is adding an algo to how latency is updated, somehow debounce latency updates that do fit within the expected standard deviation or reasonable jitter |
Oh and I have tested this patch for 48 hrs and sure enough my clients no longer jump from peer to peer, but I don't always endup with the routing peer that is actually closest to me. I think this is because of how macos devices wake from sleep |
Thanks for the explanation @hurricanehrndz, this is really cool and I fully understand what you are describing.
Yeah, this is probably exactly what is happening since all of my users (around 30) are connecting to all of those sets of routing peers. Right now I have disable HA by provisioning just 1 peer in each set of peers and the issue does not occur anymore which makes sense. Since you gave me already such great input @hurricanehrndz, may I ask one more question?. As described in: #2150 (comment) I was wondering wether moving the EKS routing peers into public subnets, rather than private would improve the latency and make the connections more stable since if I understand the routing process correctly, those should become direct/direct rather than relay/relay like they are now. There shouldn't be any security implications when provisioning those routing peers in public subnets because all of security groups attached to the public EKS nodes, would be locked down anyway since netbird uses the TCP/UPD hole punching mechanism so no ports must be opened. Am I right, thinking the above? Alternatively I have considered a setup where I would have only 1 set of routing peers in a EKS cluster, which would be running in subnets that are part of a hub and spoke transit gateway routing setup, so that all clients would connect just to those peers and then, those peers would be routing the traffic to different VPC via a Transit gateway attachment. I am a little reluctant to this idea because if I do the routing to different VPCs on Transit gateway side, I lose the lovely netbird access control policies and granular permissions I am able to design with this feature. Would love your feedback on the above and thanks much for the already provided info! |
I am going to reply in your issue, that way this thread is reserved for the PR |
Hi @hurricanehrndz, While this is great for reducing the waste of resources when recalculating the same route I do have trouble understanding how this is fixing potential route flapping as I would expect flapping to be caused by the |
@pascal-fischer you are completely right it will not resolve flapping. I had originally interpreted this user logs incorrectly, and it led me to believe that his issue #2150 was due to latency. After re-reading his log I believe the issue to be related to one of the peers making a direct connection via the other peer's tunnel. User should be able to fix that by blocking the wg port from the private address space, or ensuring that each routing peer can't reach the other since they are both in the same private address space |
Ah ok, I was confused by this comment
But the PR looks great. Will leave it like it is. |
For my own devices, which have different logs than @LeszekBlazewski it does solve the problem |
Sorry for the confusion, in this PR. @hurricanehrndz did a great job explaining everything in: #2150 (comment). |
Describe your changes
compare route update with the current routes, if routes aren't different skip recalculating of routes, like always please feel free to edit as you see fit
Issue ticket number and link
Checklist