Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Update IndexShard recovery path to work with Segment Replication. #3483

Closed
Tracked by #3969
mch2 opened this issue May 31, 2022 · 8 comments
Closed
Tracked by #3969

Comments

@mch2
Copy link
Member

mch2 commented May 31, 2022

Replicas currently use peer recovery when starting up but this has limitations with segment replication.

Peer recovery goes through the following steps when bringing up replicas.

  1. Attempt to copy segments. We can flag this for segrep to always do a file based recovery.
  2. Mark the replica as initialized on the primary. At this point the replica will start receiving new operations and write those operations to the xlog.
  3. Copy all operations from the starting seqNo (which is the seqNo of the latest operation in the replica's xlog) from the primary to the replica and index them.

Limitations with segrep:

  1. With NRTReplicationEngine on replicas, step 3 would only write those ops to the xlog. This would mean while all ops are durable on the replica, the replica needs another replication event to be brought up to date.
  2. Replicas will create their own translog with a UUID that differs from what is stored inside commit user data. I think this will cause problems when a replica is promoted to primary.
  3. Peer recovery only copies node->node, to recover from a remote store will be additional work.

The POC does this by starting the replica with an empty translog, initiate its segrep specific engine and then ping the primary to be tracked. It will then initiate a copy event and start with the latest set of segments. The POC steps leave a gap in that replicas will miss operations indexed on the primary during recovery.

This issue should introduce and have passing integration tests. These tests can be used from POC branch.

@mch2
Copy link
Member Author

mch2 commented Jun 29, 2022

Updated description here with more detail.

I'm thinking if we avoid using peer recovery it would look something like this:

  1. In SegmentReplicationTargetService we add a new method to recoverShard.
  2. Replica fetches the translog uuid from primary.
  3. Replica calls store.createEmpty, creates its xlog using the uuid from step 1, and starts its engine. Creating the empty store will make a commit, writing a segments_n to disk.
  4. After the engine is started, wipe its index Lucene.cleanLuceneIndex.
  5. Start replication process, during the first step when fetching the checkpoint metadata from the primary, we initiate tracking so that replicas start receiving ops and writing them to xlog and then commit the primary so all ops up to that point will be sent in the replication event. (edited - we don't actually need a commit, a refresh will be enough. On the replica we can generate a changes snapshot from the in-memory infos and replay them directly into the xlog). We need a commit here over a refresh so that those ops are durably persisted. If we only refresh the replica will still receive those ops in the refresh point via in memory SegmentInfos, but the latest on-disk segments_n will not. This extra commit is only required when replicating from the primary, remote store implementations can skip this.
  6. finish replication & start the shard.

cc @kartg, @nknize, @Bukhtawar, curious what you all think here.

@kartg
Copy link
Member

kartg commented Jun 29, 2022

Posting questions below to help me understand this better:

Replicas will create their own translog with a UUID that differs from what is stored inside commit user data. I think this will cause problems when a replica is promoted to primary.

Would this mismatch become a problem even earlier? As the replica performs segment replication, it will copy over the primary's translog UUID (as a part of the segment user data) which would mismatch with the replica's translog UUID

Replica calls store.createEmpty, creates its xlog using the uuid from step 1, and starts its engine. Creating the empty store will make a commit, writing a segments_n to disk.

If the replica has existing segment files and/or a translog, would these be ignored / overwritten?

...and then commit the primary so all ops up to that point will be sent in the replication event. We need a commit here over a refresh so that those ops are durably persisted. ....

To confirm my understanding:

  1. This is necessary because the Engine must be started off a commit point and can't simply rely on in-memory segmentInfos. Correct?
  2. Since we're making a commit here, there's no need to replay translog operations to the replica. Correct?

finish replication & start the shard.

To confirm my understanding:

  1. Assuming the replication group hasn't received any further write operations, then the primary and replica are guaranteed to be on the same commit point, and will have empty translogs. Correct?
  2. On the other hand, if there is a steady stream of write operations, then the replica may still be behind primary until the next refresh occurs. Both primary and replica are still guaranteed to be on the same commit point, and will have in-sync, non -empty translogs (since tracking has been initiated). However, the operations stored in the replica's translog will not be reflected in its segment files until the next refresh. Correct?"

@mch2
Copy link
Member Author

mch2 commented Jun 29, 2022

Posting questions below to help me understand this better:

Replicas will create their own translog with a UUID that differs from what is stored inside commit user data. I think this will cause problems when a replica is promoted to primary.

Would this mismatch become a problem even earlier? As the replica performs segment replication, it will copy over the primary's translog UUID (as a part of the segment user data) which would mismatch with the replica's translog UUID

It would actually cause problems on restarts as well. The engine reads the last commit, fetches the uuid, then opens the xlog with the uuid. This call will validate the uuid passed with what is stored in a pre-existing xlog, and throw if there is a mismatch.

Replica calls store.createEmpty, creates its xlog using the uuid from step 1, and starts its engine. Creating the empty store will make a commit, writing a segments_n to disk.

If the replica has existing segment files and/or a translog, would these be ignored / overwritten?

This step would be skipped, we would open the engine with what is already on disk.

...and then commit the primary so all ops up to that point will be sent in the replication event. We need a commit here over a refresh so that those ops are durably persisted. ....

To confirm my understanding:

  1. This is necessary because the Engine must be started off a commit point and can't simply rely on in-memory segmentInfos. Correct?

Its possible to start the engine from an in-memory infos, but we lose durability because we can't flush those infos on the replica unless it is promoted as a new primary. If the primary were to fail, the replica only holds those ops in memory.

Actually... the more I think on this. We could create a LuceneChangesSnapshot on the replica after the infos is copied, then replay that on itself to write them into its xlog. Then we don't actually need a commit on the primary, only a refresh.

  1. Since we're making a commit here, there's no need to replay translog operations to the replica. Correct?

Correct. Because we are initiating tracking before committing (or now refreshing), all ops up to the point of initiating tracking will be sent to replicas with the file copy.

finish replication & start the shard.

To confirm my understanding:

  1. Assuming the replication group hasn't received any further write operations, then the primary and replica are guaranteed to be on the same commit point, and will have empty translogs. Correct?

Correct.

  1. On the other hand, if there is a steady stream of write operations, then the replica may still be behind primary until the next refresh occurs. Both primary and replica are still guaranteed to be on the same commit point, and will have in-sync, non -empty translogs (since tracking has been initiated). However, the operations stored in the replica's translog will not be reflected in its segment files until the next refresh. Correct?"

Correct, the replica would only catch up after the next primary refresh.

@mch2
Copy link
Member Author

mch2 commented Jun 29, 2022

Threw together a draft to see how this works. Looks like it is but needs more tests.

@mch2 mch2 self-assigned this Jul 13, 2022
@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jul 18, 2022

  1. So when a recovery begins on the new shard, can it start with fetching the latest commit point on the primary or maybe create one, user data has from the primaries' commit has the translogUUID.
  2. Copy the segment into the new shard from the commit in (1)
  3. From the seqNo in the last commit start a PRRL on the primary.
  4. Copy LuceneChangesSnapshot and write to the new shard's txlog
  5. New write operations on a higher segNo would eventually be caught up from(4) to make up for the gaps
  6. As refreshes happen on the primary from last commit, they can be published as checkpoints to the new shard

@mch2
Copy link
Member Author

mch2 commented Jul 18, 2022

I think we actually can reuse PeerRecovery here as it exists today with some tweaks and solve this. The xlog is not created until the cleanFiles step, which is after segment copy. We can read the uuid at that point from userdata and create the xlog with it.

                Translog.createEmptyTranslog(
                    indexShard.shardPath().resolveTranslog(),
                    indexShard.shardId(),
                    globalCheckpoint,
                    indexShard.getPendingPrimaryTerm(),
                    translogUUID,
                    FileChannel::open
                );

So the only downside here is step 6. I'm not a fan of marking the shard as active & in sync while still waiting for the next primary refresh to catch up. I think I'd prefer forcing a segment copy event before we mark the shard active. We can wire in a listener that achieves this if segrep is enabled.

In the worst case this may increase recovery times, but I think we can optimize when using segment replication with remote store when we don't have the same durability concerns.

@Bukhtawar
Copy link
Collaborator

So the only downside here is step 6. I'm not a fan of marking the shard as active & in sync while still waiting for the next primary refresh to catch up. I think I'd prefer forcing a segment copy event before we mark the shard active. We can wire in a listener that achieves this if segrep is enabled.

I think we should define "in-sync" in this context. My guess is if the shard can recover ALL operations independently from its local segment store + xlog, it should be marked as in-sync. Even in steady state, the replica might not have operations(in xlog) indexed into Lucene which it would always try to wait on primary's refresh to be able to get them in a searchable state.

Sorry I couldn't catch the durability concern here. Are there operations not present in segments which are also missing in txlog. My understanding is we should have all operations in txlog from the last commit point on disk

@kartg kartg self-assigned this Jul 18, 2022
@mch2
Copy link
Member Author

mch2 commented Jul 19, 2022

I think we should define "in-sync" in this context. My guess is if the shard can recover ALL operations independently from its local segment store + xlog, it should be marked as in-sync. Even in steady state, the replica might not have operations(in xlog) indexed into Lucene which it would always try to wait on primary's refresh to be able to get them in a searchable state.

Hmm thats true. I was thinking at least after a recovery that set of docs should be also searchable. I'm also worried about the refresh taking a while after the initial recovery. Will punt this for now until we have benchmarks, if we need to implement its fairly straight forward to do from a listener.

Sorry I couldn't catch the durability concern here. Are there operations not present in segments which are also missing in txlog. My understanding is we should have all operations in txlog from the last commit point on disk

I meant with remote store we don't care about the durability concern we do today. With that we don't need to send ops during recovery only for storing in xlog and can simply copy from the latest refresh point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants