[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483

mch2 · 2022-05-31T18:01:21Z

Replicas currently use peer recovery when starting up but this has limitations with segment replication.

Peer recovery goes through the following steps when bringing up replicas.

Attempt to copy segments. We can flag this for segrep to always do a file based recovery.
Mark the replica as initialized on the primary. At this point the replica will start receiving new operations and write those operations to the xlog.
Copy all operations from the starting seqNo (which is the seqNo of the latest operation in the replica's xlog) from the primary to the replica and index them.

Limitations with segrep:

With NRTReplicationEngine on replicas, step 3 would only write those ops to the xlog. This would mean while all ops are durable on the replica, the replica needs another replication event to be brought up to date.
Replicas will create their own translog with a UUID that differs from what is stored inside commit user data. I think this will cause problems when a replica is promoted to primary.
Peer recovery only copies node->node, to recover from a remote store will be additional work.

The POC does this by starting the replica with an empty translog, initiate its segrep specific engine and then ping the primary to be tracked. It will then initiate a copy event and start with the latest set of segments. The POC steps leave a gap in that replicas will miss operations indexed on the primary during recovery.

This issue should introduce and have passing integration tests. These tests can be used from POC branch.

mch2 · 2022-06-29T18:21:12Z

Updated description here with more detail.

I'm thinking if we avoid using peer recovery it would look something like this:

In SegmentReplicationTargetService we add a new method to recoverShard.
Replica fetches the translog uuid from primary.
Replica calls store.createEmpty, creates its xlog using the uuid from step 1, and starts its engine. Creating the empty store will make a commit, writing a segments_n to disk.
After the engine is started, wipe its index Lucene.cleanLuceneIndex.
Start replication process, during the first step when fetching the checkpoint metadata from the primary, we initiate tracking so that replicas start receiving ops and writing them to xlog and then commit the primary so all ops up to that point will be sent in the replication event. (edited - we don't actually need a commit, a refresh will be enough. On the replica we can generate a changes snapshot from the in-memory infos and replay them directly into the xlog). We need a commit here over a refresh so that those ops are durably persisted. If we only refresh the replica will still receive those ops in the refresh point via in memory SegmentInfos, but the latest on-disk segments_n will not. This extra commit is only required when replicating from the primary, remote store implementations can skip this.
finish replication & start the shard.

cc @kartg, @nknize, @Bukhtawar, curious what you all think here.

kartg · 2022-06-29T19:43:21Z

Posting questions below to help me understand this better:

Replicas will create their own translog with a UUID that differs from what is stored inside commit user data. I think this will cause problems when a replica is promoted to primary.

Would this mismatch become a problem even earlier? As the replica performs segment replication, it will copy over the primary's translog UUID (as a part of the segment user data) which would mismatch with the replica's translog UUID

Replica calls store.createEmpty, creates its xlog using the uuid from step 1, and starts its engine. Creating the empty store will make a commit, writing a segments_n to disk.

If the replica has existing segment files and/or a translog, would these be ignored / overwritten?

...and then commit the primary so all ops up to that point will be sent in the replication event. We need a commit here over a refresh so that those ops are durably persisted. ....

To confirm my understanding:

This is necessary because the Engine must be started off a commit point and can't simply rely on in-memory segmentInfos. Correct?
Since we're making a commit here, there's no need to replay translog operations to the replica. Correct?

finish replication & start the shard.

To confirm my understanding:

Assuming the replication group hasn't received any further write operations, then the primary and replica are guaranteed to be on the same commit point, and will have empty translogs. Correct?
On the other hand, if there is a steady stream of write operations, then the replica may still be behind primary until the next refresh occurs. Both primary and replica are still guaranteed to be on the same commit point, and will have in-sync, non -empty translogs (since tracking has been initiated). However, the operations stored in the replica's translog will not be reflected in its segment files until the next refresh. Correct?"

mch2 · 2022-06-29T21:28:45Z

Posting questions below to help me understand this better:

Replicas will create their own translog with a UUID that differs from what is stored inside commit user data. I think this will cause problems when a replica is promoted to primary.

Would this mismatch become a problem even earlier? As the replica performs segment replication, it will copy over the primary's translog UUID (as a part of the segment user data) which would mismatch with the replica's translog UUID

It would actually cause problems on restarts as well. The engine reads the last commit, fetches the uuid, then opens the xlog with the uuid. This call will validate the uuid passed with what is stored in a pre-existing xlog, and throw if there is a mismatch.

Replica calls store.createEmpty, creates its xlog using the uuid from step 1, and starts its engine. Creating the empty store will make a commit, writing a segments_n to disk.

If the replica has existing segment files and/or a translog, would these be ignored / overwritten?

This step would be skipped, we would open the engine with what is already on disk.

...and then commit the primary so all ops up to that point will be sent in the replication event. We need a commit here over a refresh so that those ops are durably persisted. ....

To confirm my understanding:

This is necessary because the Engine must be started off a commit point and can't simply rely on in-memory segmentInfos. Correct?

Its possible to start the engine from an in-memory infos, but we lose durability because we can't flush those infos on the replica unless it is promoted as a new primary. If the primary were to fail, the replica only holds those ops in memory.

Actually... the more I think on this. We could create a LuceneChangesSnapshot on the replica after the infos is copied, then replay that on itself to write them into its xlog. Then we don't actually need a commit on the primary, only a refresh.

Since we're making a commit here, there's no need to replay translog operations to the replica. Correct?

Correct. Because we are initiating tracking before committing (or now refreshing), all ops up to the point of initiating tracking will be sent to replicas with the file copy.

finish replication & start the shard.

To confirm my understanding:

Assuming the replication group hasn't received any further write operations, then the primary and replica are guaranteed to be on the same commit point, and will have empty translogs. Correct?

Correct.

On the other hand, if there is a steady stream of write operations, then the replica may still be behind primary until the next refresh occurs. Both primary and replica are still guaranteed to be on the same commit point, and will have in-sync, non -empty translogs (since tracking has been initiated). However, the operations stored in the replica's translog will not be reflected in its segment files until the next refresh. Correct?"

Correct, the replica would only catch up after the next primary refresh.

mch2 · 2022-06-29T23:03:45Z

Threw together a draft to see how this works. Looks like it is but needs more tests.

Bukhtawar · 2022-07-18T10:17:04Z

So when a recovery begins on the new shard, can it start with fetching the latest commit point on the primary or maybe create one, user data has from the primaries' commit has the translogUUID.
Copy the segment into the new shard from the commit in (1)
From the seqNo in the last commit start a PRRL on the primary.
Copy LuceneChangesSnapshot and write to the new shard's txlog
New write operations on a higher segNo would eventually be caught up from(4) to make up for the gaps
As refreshes happen on the primary from last commit, they can be published as checkpoints to the new shard

mch2 · 2022-07-18T17:53:44Z

I think we actually can reuse PeerRecovery here as it exists today with some tweaks and solve this. The xlog is not created until the cleanFiles step, which is after segment copy. We can read the uuid at that point from userdata and create the xlog with it.

                Translog.createEmptyTranslog(
                    indexShard.shardPath().resolveTranslog(),
                    indexShard.shardId(),
                    globalCheckpoint,
                    indexShard.getPendingPrimaryTerm(),
                    translogUUID,
                    FileChannel::open
                );

So the only downside here is step 6. I'm not a fan of marking the shard as active & in sync while still waiting for the next primary refresh to catch up. I think I'd prefer forcing a segment copy event before we mark the shard active. We can wire in a listener that achieves this if segrep is enabled.

In the worst case this may increase recovery times, but I think we can optimize when using segment replication with remote store when we don't have the same durability concerns.

Bukhtawar · 2022-07-18T18:06:21Z

So the only downside here is step 6. I'm not a fan of marking the shard as active & in sync while still waiting for the next primary refresh to catch up. I think I'd prefer forcing a segment copy event before we mark the shard active. We can wire in a listener that achieves this if segrep is enabled.

I think we should define "in-sync" in this context. My guess is if the shard can recover ALL operations independently from its local segment store + xlog, it should be marked as in-sync. Even in steady state, the replica might not have operations(in xlog) indexed into Lucene which it would always try to wait on primary's refresh to be able to get them in a searchable state.

Sorry I couldn't catch the durability concern here. Are there operations not present in segments which are also missing in txlog. My understanding is we should have all operations in txlog from the last commit point on disk

mch2 · 2022-07-19T02:09:14Z

I think we should define "in-sync" in this context. My guess is if the shard can recover ALL operations independently from its local segment store + xlog, it should be marked as in-sync. Even in steady state, the replica might not have operations(in xlog) indexed into Lucene which it would always try to wait on primary's refresh to be able to get them in a searchable state.

Hmm thats true. I was thinking at least after a recovery that set of docs should be also searchable. I'm also worried about the refresh taking a while after the initial recovery. Will punt this for now until we have benchmarks, if we need to implement its fairly straight forward to do from a listener.

Sorry I couldn't catch the durability concern here. Are there operations not present in segments which are also missing in txlog. My understanding is we should have all operations in txlog from the last commit point on disk

I meant with remote store we don't care about the durability concern we do today. With that we don't need to send ops during recovery only for storing in xlog and can simply copy from the latest refresh point.

mch2 mentioned this issue May 31, 2022

[Segment Replication] Merge back to main tracking issue. #2355

Closed

11 tasks

mch2 self-assigned this May 31, 2022

kotwanikunal unassigned mch2 Jun 3, 2022

xuezhou25 added the distributed framework label Jun 7, 2022

Rishikesh1159 assigned Rishikesh1159 and Poojita-Raj Jun 27, 2022

mch2 mentioned this issue Jun 29, 2022

[Segment Replication] Wire up segment replication with peer recovery and add ITs. #3743

Merged

3 tasks

mch2 self-assigned this Jul 13, 2022

kartg self-assigned this Jul 18, 2022

mch2 mentioned this issue Jul 20, 2022

[META] Segment Replication Issue list #2194

Closed

31 tasks

mch2 closed this as completed Jul 22, 2022

This was referenced Jul 22, 2022

[Segment Replication] Experimental Release Tracking #3969

Closed

[Segment Replication] Trigger segment copy after initial recovery. #4036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483

[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483

mch2 commented May 31, 2022 •

edited

Loading

mch2 commented Jun 29, 2022 •

edited

Loading

kartg commented Jun 29, 2022

mch2 commented Jun 29, 2022 •

edited

Loading

mch2 commented Jun 29, 2022

Bukhtawar commented Jul 18, 2022 •

edited

Loading

mch2 commented Jul 18, 2022 •

edited

Loading

Bukhtawar commented Jul 18, 2022

mch2 commented Jul 19, 2022 •

edited

Loading

[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483

[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483

Comments

mch2 commented May 31, 2022 • edited Loading

mch2 commented Jun 29, 2022 • edited Loading

kartg commented Jun 29, 2022

mch2 commented Jun 29, 2022 • edited Loading

mch2 commented Jun 29, 2022

Bukhtawar commented Jul 18, 2022 • edited Loading

mch2 commented Jul 18, 2022 • edited Loading

Bukhtawar commented Jul 18, 2022

mch2 commented Jul 19, 2022 • edited Loading

mch2 commented May 31, 2022 •

edited

Loading

mch2 commented Jun 29, 2022 •

edited

Loading

mch2 commented Jun 29, 2022 •

edited

Loading

Bukhtawar commented Jul 18, 2022 •

edited

Loading

mch2 commented Jul 18, 2022 •

edited

Loading

mch2 commented Jul 19, 2022 •

edited

Loading