-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: Fix load reporting when pick first is used for locality-routing. #11495
Conversation
…mption that all addresses for a subchannel are in the same locality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've recommended a number of minor changes to comments
/** | ||
* An internal class. Do not use. | ||
* | ||
* <p>An interface to provide the address connected by subchannel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says it provides the address
, but the method provides address attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the interface name to InternalSubchannelAddressAttributes
.
@Override | ||
public void shutdown() { | ||
if (localityStats != null) { | ||
localityStats.release(); | ||
if (localityAtomicReference.get() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't possibly be null AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Removed the null
check.
// attributes with its locality, including endpoints in LOGICAL_DNS clusters. | ||
// In case of not (which really shouldn't), loads are aggregated under an empty | ||
// locality. | ||
if (locality == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be, but is possible that locality is set but locality name is null. You should probably have an else clause that does a null check on localityName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is existing code, and it seems in general code assumes the values to be non-null. I'd much rather we handle that centrally in io.grpc.xds.client.Locality.create()
than lots of random not-possible-to-trigger checks. The data is primarily coming from a proto, so it will be ""
when unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to move the locality name null check to io.grpc.xds.client.Locality.create()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see now Larry's comment wasn't about the Locality struct, but instead the attribute. That had been discussed when the code was originally introduced:
#11133 (comment)
This PR is likely to be reverted (in some way) later, once the old PF policy goes away. So changes to the existing code are likely to be lost.
@@ -447,4 +480,33 @@ public void onLoadReport(MetricReport report) { | |||
stats.recordBackendLoadMetricStats(report.getNamedMetrics()); | |||
} | |||
} | |||
|
|||
/** | |||
* Represents the locality attributes of a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually the stats and the name. Name could be considered an attribute, the the stats aren't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. PTAL
@@ -91,7 +91,7 @@ private synchronized void releaseClusterDropCounter( | |||
String cluster, @Nullable String edsServiceName) { | |||
checkState(allDropStats.containsKey(cluster) | |||
&& allDropStats.get(cluster).containsKey(edsServiceName), | |||
"stats for cluster %s, edsServiceName %s not exits", cluster, edsServiceName); | |||
"stats for cluster %s, edsServiceName %s not exist", cluster, edsServiceName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/not exist/do not exist/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// attributes with its locality, including endpoints in LOGICAL_DNS clusters. | ||
// In case of not (which really shouldn't), loads are aggregated under an empty | ||
// locality. | ||
if (locality == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is existing code, and it seems in general code assumes the values to be non-null. I'd much rather we handle that centrally in io.grpc.xds.client.Locality.create()
than lots of random not-possible-to-trigger checks. The data is primarily coming from a proto, so it will be ""
when unset.
|
||
// This clusterLocality ideally should not be utilized. We derive locality | ||
// information from the first address, even prior to the subchannel becoming ready. | ||
// This is primarily for the purpose of facilitating load reporting in the pre-READY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no load reporting pre-READY. But an LB API could return the subchannel even before it is ready, and PF does this. We generally won't end up reporting load for such picks because the channel will ignore the selected (not-ready) subchannel, but we needed to handle the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, updated the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you miss pushing the update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, we both are able to see the updated comment.
private static final Attributes.Key<AtomicReference<ClusterLocality>> | ||
ATTR_CLUSTER_LOCALITY = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put ATTR_CLUSTER_LOCALITY on the same line as the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
This fixes Incorrect load reporting when using pick first as locality-routing policy.
Cause: ClusterImplLoadBalancer used to assume all the address are in the same locality and used first address in the list of addresses to determine locality. This is not always true leading to incorrect load reporting.
Fix: Get locality from address connected by the subchannel.
Closes #11434.