-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time sync woes, CockroachDB cluster failure following add/expunge testing #7534
Comments
Collecting sled identitiesSled expungements:
Cubby numbers:
(BRM23230010 is currently powered off and presumed to be sled 17)
Live sleds:
CockroachDB node identities:
History of sled 15 in particular:
History of sled 17 in particular:
|
Blueprint history
|
History of boundary NTP zonesUsed this:
To see when boundary zones changed. Results (assembled by hand from above): Initial: fd00:1122:3344:102::10, fd00:1122:3344:101::10 (sled 15, 14) Sled 16 (internal) NTP tracking logThis is a sled that never hosted a boundary NTP zone.
Sled 14 (boundary) NTP tracking infoThis is the sled that lost external network connectivity but does not appear to have drifted much from real time. This is the log after restoring external connectivity by hand using
Based on this, it looks like the route was probably added around 2025-02-12T22:45Z. Sled 14’s first reported data point shows an offset of -2.515e-02, or -25ms. This is not nearly enough to explain the CockroachDB observations about the time delta. |
Timeline of expungement and CockroachDB cluster status(hand-assembled from raw data elsewhere in this issue)
About 18:06Z: we clear CockroachDB on sled 14. At this point, the other nodes (node 16) must have switched to tracking sled 14, so clocks are back in sync, it comes up, and we’re debugging. |
Investigating the NTP zone's lack of connectivity
At the time we started debugging, this boundary NTP zone's lack of connectivity was due to a missing OPTE route. The
Note the very high hit count on the output
We manually added this rule via We haven't yet been able to reproduce this. Two theories from walking through the relevant code paths:
One possibility for 2: there are handful of
If I hack up the
I have neither evidence this actually happened on dublin nor a hypothesis for why this query failed, but if it did, I think it would explain the situation we saw. |
I have also saved the output of |
This stems from part of #7534. In understanding that, we have a hypothetical failure mode where any lookup which fails due to a database failure (rather than a name lookup) could lead to gaps in routes, which would require a router bump to retry. This PR changes resolution of all items such that any unexpected DB failures (transient or otherwise) *will* cause the router's resolution to bail. It will take some time before it's recomputed, but we know it will be done.
After finishing the expungement of sled 17 yesterday, today we discovered the boundary NTP zone that had been placed on sled 15 was in the same state (i.e., missing the default IPv4 route in opte):
We confirmed that sled-agent claimed to have the most recent version of the system route (10):
We then started capturing sled-agent
and manually sent a
sled-agent logged that it thought it needed 8 routes, including
The next request from Nexus that re-sent version 10 also had 8 routes, including
and after sled-agent processed that request,
This seems to imply that the problem is on the sled-agent side: it knew there were supposed to be 8 routes (which it presumably got from Nexus), but only 7 were present. Manually deleting those routes allowed it to process the existing set from Nexus again, which restored things to working order. |
This actually means something different from what it did in early releases.
If a boundary server is advertising with stratum 10 it therefore means two things:
If it was never synchronised, then it will never advertise an authoritative time, which we did deliberately to avoid potential race conditions around sled boot with lack of upstream connectivity to the NTP zone. In my experience the clocks across a fleet of gimlets running free are very consistent, but we should perhaps look at the units in Dublin. |
Looking at sled 17 in dublin which appears to be in the same state again, and knowing what actions were performed, I suspect that the following sequence of events has occurred:
The configuration states that the server should revert to the local reference at stratum 10 once the root distance increases above 0.4. That distance is calculated as Here is the current state - some 89 minutes after connectivity was lost:
Which means that I've placed the chrony logs, current state snapshot and core files from both chronyd processes in this zone at |
Minor typo, but that should be an
which is not close to 0.4, still, but just double checking my understanding |
I spent some time today looking into understanding why the boundary ntp server on sled 17 was behaving as it was. Andy's comment here (#7534 (comment)) has a good summary of what he thinks happened. A couple of things that threw me off initially coming up to speed on this particular issue:
As pointed out in Andy's comment, the distance at which the boundary ntp server is configured to fall back to stratum 10 is 0.4. The distance has increased since his comment, but not by a lot: the root delay is still the same, but the dispersion is increasing:
That's a distance of ~0.0577, up from ~0.0259 several hours ago. Assuming the distance increases linearly (which would make sense if only the dispersion has changed in the past several hours), and if my back of the envelope math is correct, that would mean about 12 hours until we hit the threshold distance for the boundary server to raise its stratum value. For comparison, here is the other current (working) boundary ntp tracking info from sled 14, which we can compute has a distance of ~0.0136.
This leaves us with at least two questions to explore:
|
I span up an OmniOS VM on colo to get some more data on how chrony behaves when it is:
Here is the starting stable state:
Running a simple monitoring script to track root delay, dispersion and distance. #!/bin/ksh
while :; do
out=`chronyc -c tracking`
IFS=, set -- $out
delay=${11}
disp=${12}
((distance = delay / 2 + disp))
printf "%s %.5f %.5f -- %.5f\n" `date` delay disp distance
sleep 10
done We're off to a good start, and then I remove connectivity by adding a blackhole route:
Some time later...
This is consistent with what we saw in that the root distance is not increasing quickly and chrony is not going to fall back to its At this point we've been down for over two hours.
|
Eventually that test system does fall back to stratum 10:
but this is after 60 hours.
This is obviously not terribly quick or desirable, especially if the other boundary time server is also serving stratum 2 but has an active upstream. So, how do we improve things here?
|
This stems from attempting to understand #7534/#7541. In reviewing that, we have a hypothetical failure mode where any lookup which fails due to a database failure (rather than a name lookup) could lead to gaps in routes, which would require a router bump to retry. This PR changes resolution of all items such that any unexpected DB failures (transient or otherwise) *will* cause the router's resolution to bail. It will take some time before it's recomputed, but we know it will be done.
This stems from attempting to understand #7534/#7541. In reviewing that, we have a hypothetical failure mode where any lookup which fails due to a database failure (rather than a name lookup) could lead to gaps in routes, which would require a router bump to retry. This PR changes resolution of all items such that any unexpected DB failures (transient or otherwise) *will* cause the router's resolution to bail. It will take some time before it's recomputed, but we know it will be done.
I wanted to record here in GitHub the investigation of several related issues in the dublin test environment today. For folks within Oxide, this information came from this document but I'm intending to include all the relevant bits here.
Context:
Initial symptoms: after a few hours, the CockroachDB cluster was found offline:
CockroachDB on sled 14 is offline, with the SMF service in maintenance:
It took some digging to find the real problem from the logs, but ultimately it was in /data/logs/cockroach.oxzcockroachdb46a676ec-b910-44d2-8a68-e3058b33e74e.root.2025-02-12T09_23_18Z.021171.log:
Here's a summary of what we believe happened. I'll put more details in separate comments (for others who helped with the investigation, please add your notes too!):
It was around this time we started debugging.
svcadm clear
to bring up the CockroachDB node on sled 14 and it came up fine.There remain a few open questions:
There are a a few issues we probably want to file here:
The text was updated successfully, but these errors were encountered: