-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483
Comments
Updated description here with more detail. I'm thinking if we avoid using peer recovery it would look something like this:
cc @kartg, @nknize, @Bukhtawar, curious what you all think here. |
Posting questions below to help me understand this better:
Would this mismatch become a problem even earlier? As the replica performs segment replication, it will copy over the primary's translog UUID (as a part of the segment user data) which would mismatch with the replica's translog UUID
If the replica has existing segment files and/or a translog, would these be ignored / overwritten?
To confirm my understanding:
To confirm my understanding:
|
It would actually cause problems on restarts as well. The engine reads the last commit, fetches the uuid, then opens the xlog with the uuid. This call will validate the uuid passed with what is stored in a pre-existing xlog, and throw if there is a mismatch.
This step would be skipped, we would open the engine with what is already on disk.
Its possible to start the engine from an in-memory infos, but we lose durability because we can't flush those infos on the replica unless it is promoted as a new primary. If the primary were to fail, the replica only holds those ops in memory. Actually... the more I think on this. We could create a
Correct. Because we are initiating tracking before committing (or now refreshing), all ops up to the point of initiating tracking will be sent to replicas with the file copy.
Correct.
Correct, the replica would only catch up after the next primary refresh. |
Threw together a draft to see how this works. Looks like it is but needs more tests. |
|
I think we actually can reuse PeerRecovery here as it exists today with some tweaks and solve this. The xlog is not created until the cleanFiles step, which is after segment copy. We can read the uuid at that point from userdata and create the xlog with it. Translog.createEmptyTranslog(
indexShard.shardPath().resolveTranslog(),
indexShard.shardId(),
globalCheckpoint,
indexShard.getPendingPrimaryTerm(),
translogUUID,
FileChannel::open
); So the only downside here is step 6. I'm not a fan of marking the shard as active & in sync while still waiting for the next primary refresh to catch up. I think I'd prefer forcing a segment copy event before we mark the shard active. We can wire in a listener that achieves this if segrep is enabled. In the worst case this may increase recovery times, but I think we can optimize when using segment replication with remote store when we don't have the same durability concerns. |
I think we should define "in-sync" in this context. My guess is if the shard can recover ALL operations independently from its local segment store + xlog, it should be marked as in-sync. Even in steady state, the replica might not have operations(in xlog) indexed into Lucene which it would always try to wait on primary's refresh to be able to get them in a searchable state. Sorry I couldn't catch the durability concern here. Are there operations not present in segments which are also missing in txlog. My understanding is we should have all operations in txlog from the last commit point on disk |
Hmm thats true. I was thinking at least after a recovery that set of docs should be also searchable. I'm also worried about the refresh taking a while after the initial recovery. Will punt this for now until we have benchmarks, if we need to implement its fairly straight forward to do from a listener.
I meant with remote store we don't care about the durability concern we do today. With that we don't need to send ops during recovery only for storing in xlog and can simply copy from the latest refresh point. |
Replicas currently use peer recovery when starting up but this has limitations with segment replication.
Peer recovery goes through the following steps when bringing up replicas.
Limitations with segrep:
The POC does this by starting the replica with an empty translog, initiate its segrep specific engine and then ping the primary to be tracked. It will then initiate a copy event and start with the latest set of segments. The POC steps leave a gap in that replicas will miss operations indexed on the primary during recovery.
This issue should introduce and have passing integration tests. These tests can be used from POC branch.
The text was updated successfully, but these errors were encountered: