Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore non unique route updates #2266

Merged

Conversation

hurricanehrndz
Copy link
Contributor

@hurricanehrndz hurricanehrndz commented Jul 12, 2024

Describe your changes

compare route update with the current routes, if routes aren't different skip recalculating of routes, like always please feel free to edit as you see fit

Issue ticket number and link

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • Extended the README / documentation, if necessary

@hurricanehrndz hurricanehrndz changed the title Ignore no unique route updates Ignore non unique route updates Jul 12, 2024
@LeszekBlazewski
Copy link

Looking forward for this fix so that we can reenable the HA feature in our setup!

I am not super familiar with go and netbird source code so I am not able to provide proper feedback wether the proposed solution is OK and covers all cases but at the first glance it seems reasonable.

Maybe adding a test case for the problem this solution fixes would ensure that in the future, this issue won't come up as a regression?

Alternatively, rather than fixing this in the handleUpdate function as proposed in this PR, the following algorithm could be reevaluated: getBestRouteFromStatuses, where there is already:

		// we compare the current score + 10ms to the chosen score to avoid flapping between routes
		if currScore != 0 && currScore+0.01 > chosenScore {
			log.Debugf("Keeping current routing peer because the score difference with latency is less than 0.01(10ms), current: %f, new: %f", currScore, chosenScore)
			return currID
		}

Which seems like a solution that does not fix this as of now, as in the linked issue logs, the routes are changing constantly.

Maybe some rounding should be added to the score calculation and a condition to not consider route switching which scores are just marginally different (as in the logs inside the linked issue).

@hurricanehrndz
Copy link
Contributor Author

Looking forward for this fix so that we can reenable the HA feature in our setup!

I am not super familiar with go and netbird source code so I am not able to provide proper feedback wether the proposed solution is OK and covers all cases but at the first glance it seems reasonable.

Maybe adding a test case for the problem this solution fixes would ensure that in the future, this issue won't come up as a regression?

Alternatively, rather than fixing this in the handleUpdate function as proposed in this PR, the following algorithm could be reevaluated: getBestRouteFromStatuses, where there is already:

		// we compare the current score + 10ms to the chosen score to avoid flapping between routes
		if currScore != 0 && currScore+0.01 > chosenScore {
			log.Debugf("Keeping current routing peer because the score difference with latency is less than 0.01(10ms), current: %f, new: %f", currScore, chosenScore)
			return currID
		}

Which seems like a solution that does not fix this as of now, as in the linked issue logs, the routes are changing constantly.

Maybe some rounding should be added to the score calculation and a condition to not consider route switching which scores are just marginally different (as in the logs inside the linked issue).

I have tried playing with the threshold, and in all honesty it will decrease the number of times you hop from peer to peer, but not really a fix. I need to add a backoff routine to the statewatcher so as to avoid firing when connected before there is a latency value

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Jul 13, 2024

Perhaps another addition would be to make the threshold before it jumps to another routing peer configurable, but any number of network conditions can affect latency negatively and force the clients to rebalance.

@hurricanehrndz
Copy link
Contributor Author

@LeszekBlazewski what do you mean the routes are constantly changing? Is the routing peer having new network routes added via API as new networks are provision in AWS?

@LeszekBlazewski
Copy link

LeszekBlazewski commented Jul 14, 2024

@LeszekBlazewski what do you mean the routes are constantly changing? Is the routing peer having new network routes added via API as new networks are provision in AWS?

@hurricanehrndz No, the logs in the linked issue, describe the following scenario:

There are 3 sets of HA routing peers (1 set for each Kubernetes cluster), each set consists of 2 netbird clients which are running in Kubernetes pods as docker containers which are spread across 2 availability zones (so each netbird peer runs on a different EC2 in a different availability zone). Each set of netbird peers is responsible for routing traffic inside the VPC in which the Kubernetes cluster runs. This essentially means that each network route I have configured is HA where the group which handles the traffic is one of the netbird client set (So I have 3 HA routes in total). Additionally, temporary we have an exit node which is one of routing sets. This can be observer in the screenshot in comment here: #2150 (comment)

The above sets of netbird pods in the 3 different EKS clusters are not reprovisioned and nothing is changed. Simply from time to time when I was trying to connect into netbird (from my macos client), I have observed a scenario where my MacOS client couldn't decide which peer to connect to and was constantly switching routes between 2 peers inside the routing group for given set. This occurred only sometimes and was not consistent but did happen every now and then when connecting to netbird. When this happened, due to the constant switching, I wasn't able to access the underlying resources properly because more than half of the requests were failing and only few of those went through. The logs just kept on going until I disconnected with my MacOS client and then reconnected (from time to time the issue reoccured tho).

@hurricanehrndz
Copy link
Contributor Author

@LeszekBlazewski what do you mean the routes are constantly changing? Is the routing peer having new network routes added via API as new networks are provision in AWS?

@hurricanehrndz No, the logs in the linked issue, describe the following scenario:

There are 3 sets of HA routing peers (1 set for each Kubernetes cluster), each set consists of 2 netbird clients which are running in Kubernetes pods as docker containers which are spread across 2 availability zones (so each netbird peer runs on a different EC2 in a different availability zone). Each set of netbird peers is responsible for routing traffic inside the VPC in which the Kubernetes cluster runs. This essentially means that each network route I have configured is HA where the group which handles the traffic is one of the netbird client set (So I have 3 HA routes in total). Additionally, temporary we have an exit node which is one of routing sets. This can be observer in the screenshot in comment here: #2150 (comment)

The above sets of netbird pods in the 3 different EKS clusters are not reprovisioned and nothing is changed. Simply from time to time when I was trying to connect into netbird (from my macos client), I have observed a scenario where my MacOS client couldn't decide which peer to connect to and was constantly switching routes between 2 peers inside the routing group for given set. This occurred only sometimes and was not consistent but did happen every now and then when connecting to netbird. When this happened, due to the constant switching, I wasn't able to access the underlying resources properly because more than half of the requests were failing and only few of those went through. The logs just kept on going until I disconnected with my MacOS client and then reconnected (from time to time the issue reoccured tho).

Yes, I understand the problem clearly since I have the same issue within my own AWS environment.

In my own environment I have noticed that the management endpoint will send route updates whenever a new client connects. When this occurs [getBestRouteFromStatuses](https://github.com/netbirdio/netbird/blob/main/client/internal/routemanager/client.go#L103) is called. Which makes no sense since the routes haven't really changed at least for me due to ACLs we have in the admin.

So my patch ignores all routing updates, that are not actually unique.

I am not sure if in the end this will be the end all and be all solution, I only purpose this in order to give the NB devs inspiration.

getBestRouteFromStatues will stick to the same routing peer as long as it is not 10ms slower than a faster peer. Because your peers are jumping from peer to peer it means the latency is being effected by the number of clients connected and/or load. I think this probably due to running on EKS and the underlying kernel not tuned for a high connection routing load. So changing how the latency is compared I think will do very little in the situation

So by insuringgetBestRouteFromStatues is only called when necessary, it should force the macOS client to stick to a specific routing peer until a new routing peer is added to the HA or until a routing peer is disconnected.

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Jul 14, 2024

The other potential solution I have been thinking about is adding an algo to how latency is updated, somehow debounce latency updates that do fit within the expected standard deviation or reasonable jitter

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Jul 14, 2024

Oh and I have tested this patch for 48 hrs and sure enough my clients no longer jump from peer to peer, but I don't always endup with the routing peer that is actually closest to me. I think this is because of how macos devices wake from sleep

@LeszekBlazewski
Copy link

LeszekBlazewski commented Jul 14, 2024

Thanks for the explanation @hurricanehrndz, this is really cool and I fully understand what you are describing.

Because your peers are jumping from peer to peer it means the latency is being effected by the number of clients connected and/or load. I think this probably due to running on EKS and the underlying kernel not tuned for a high connection routing load

Yeah, this is probably exactly what is happening since all of my users (around 30) are connecting to all of those sets of routing peers. Right now I have disable HA by provisioning just 1 peer in each set of peers and the issue does not occur anymore which makes sense.

Since you gave me already such great input @hurricanehrndz, may I ask one more question?. As described in: #2150 (comment) I was wondering wether moving the EKS routing peers into public subnets, rather than private would improve the latency and make the connections more stable since if I understand the routing process correctly, those should become direct/direct rather than relay/relay like they are now. There shouldn't be any security implications when provisioning those routing peers in public subnets because all of security groups attached to the public EKS nodes, would be locked down anyway since netbird uses the TCP/UPD hole punching mechanism so no ports must be opened. Am I right, thinking the above?

Alternatively I have considered a setup where I would have only 1 set of routing peers in a EKS cluster, which would be running in subnets that are part of a hub and spoke transit gateway routing setup, so that all clients would connect just to those peers and then, those peers would be routing the traffic to different VPC via a Transit gateway attachment. I am a little reluctant to this idea because if I do the routing to different VPCs on Transit gateway side, I lose the lovely netbird access control policies and granular permissions I am able to design with this feature.

Would love your feedback on the above and thanks much for the already provided info!

@hurricanehrndz
Copy link
Contributor Author

I am going to reply in your issue, that way this thread is reserved for the PR

@pascal-fischer
Copy link
Contributor

Hi @hurricanehrndz,
thank you for your contribution. This will definitely help reduce unnecessary recalculation of the same routing map (we need to have a look to see if we can already fix this at the source but this might require some bigger refactoring).

While this is great for reducing the waste of resources when recalculating the same route I do have trouble understanding how this is fixing potential route flapping as I would expect flapping to be caused by the peerStateUpdate channel in case of different scores due to latency or connection changes. Having the same routing map recalculated should result in the same chosen routes and therefore no flapping?

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Jul 15, 2024

@pascal-fischer you are completely right it will not resolve flapping. I had originally interpreted this user logs incorrectly, and it led me to believe that his issue #2150 was due to latency. After re-reading his log I believe the issue to be related to one of the peers making a direct connection via the other peer's tunnel. User should be able to fix that by blocking the wg port from the private address space, or ensuring that each routing peer can't reach the other since they are both in the same private address space

@pascal-fischer
Copy link
Contributor

Ah ok, I was confused by this comment

I have tested this patch for 48 hrs and sure enough my clients no longer jump from peer to peer

But the PR looks great. Will leave it like it is.

@hurricanehrndz
Copy link
Contributor Author

Ah ok, I was confused by this comment

I have tested this patch for 48 hrs and sure enough my clients no longer jump from peer to peer

But the PR looks great. Will leave it like it is.

For my own devices, which have different logs than @LeszekBlazewski it does solve the problem

@LeszekBlazewski
Copy link

Sorry for the confusion, in this PR. @hurricanehrndz did a great job explaining everything in: #2150 (comment).

@mlsmaycon mlsmaycon merged commit 12ff93b into netbirdio:main Jul 16, 2024
21 checks passed
@hurricanehrndz hurricanehrndz deleted the norecalcuate_on_no_route_change branch July 17, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants