Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

Open
vinaykpud opened this issue Feb 12, 2025 · 6 comments · May be fixed by #17457
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@vinaykpud
Copy link
Contributor

vinaykpud commented Feb 12, 2025

Is your feature request related to a problem? Please describe

Context:
Created a 5 node cluster and created an index with 1P and 1 Search replica.

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles      cluster_manager name
172.18.0.3           26          90   5    2.19    1.84     1.69 d         data            -               opensearch-node5
172.18.0.4           23          90   5    2.19    1.84     1.69 -         coordinating    -               opensearch-node1
172.18.0.2           27          90   6    2.19    1.84     1.69 d         data            -               opensearch-node3
172.18.0.5           23          90   5    2.19    1.84     1.69 d         data            -               opensearch-node4
172.18.0.6           30          90   5    2.19    1.84     1.69 m         cluster_manager *               opensearch-node2

Now following is the shard assignment:

index    shard prirep state   docs store ip         node
products 0     p      STARTED    0  230b 172.18.0.3 opensearch-node5
products 0     s      STARTED    0  230b 172.18.0.2 opensearch-node3

simulate a node drop, Since I am running the cluster locally using docker, I stopped the node3.

index    shard prirep state      docs store ip         node
products 0     p      STARTED       0  230b 172.18.0.3 opensearch-node5
products 0     s      UNASSIGNED

After 1 min, AllocationService will try to allocate the search shard to node4 and it will fail with bellow exception

2025-02-09 13:07:53 "stacktrace": ["org.opensearch.indices.recovery.RecoveryFailedException: [products3][0]: Recovery failed on {opensearch-node8}{eHuGysErRFuGUCFO2KxGuw}{Wby_7fTEToWk5bnavKBlbA}{172.18.0.9}{172.18.0.9:9300}{dimr}{zone=zone3, shard_indexing_pressure_enabled=true}",
2025-02-09 13:07:53 "at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3902) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:618) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:123) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2919) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
2025-02-09 13:07:53 "at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
2025-02-09 13:07:53 "at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]",
2025-02-09 13:07:53 "Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed to fetch index version after copying it over",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:717) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more",
2025-02-09 13:07:53 "Caused by: org.opensearch.index.shard.IndexShardRecoveryException: shard allocated for local recovery (post api), should exist, but doesn't, current files: []",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:702) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more",
2025-02-09 13:07:53 "Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/usr/share/opensearch/data/nodes/0/indices/PFeBY9eRTaKcRhDKEM2WAQ/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@52539624)): files: []",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:808) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:764) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:542) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:526) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.opensearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:135) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.store.Store.readSegmentsInfo(Store.java:255) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:237) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:692) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more"] }

Describe the solution you'd like

This is happens because in ShardRouting

when moveToUnassigned is called, we set the recoverySource as ExistingStoreRecoverySource for the search replica.
Since this scenario involves in recovering the shard in another node, it wont have any files in the local store for recovery and fails with exception. So solution is when its search replica always use EmptyStoreRecoverySource.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

@mch2
Copy link
Member

mch2 commented Feb 13, 2025

@vinaykpud This will impact cases where a SR node is restarted, can you check our recovery logic to see if thats the case? Ie. we still want to diff local segments and ensure we only fetch wahts required to recover.

@vinaykpud
Copy link
Contributor Author

Added integ test for reproducing this : d89c1cf

@vinaykpud
Copy link
Contributor Author

vinaykpud commented Feb 14, 2025

@mch2 Yes. If the SR node restarts, we should consider loading the available local files instead of starting with an empty directory. In this case, the node restart causes the search replica to become unassigned. When the node comes back up, the allocator attempts to reassign the search replica to the same node. Since the shard was previously assigned to this node, it should already have the necessary files. Therefore, if any local files exist, we should load them.

@vinaykpud
Copy link
Contributor Author

Following is the method that executes when allocating a shard to a node:

private void internalRecoverFromStore(IndexShard indexShard) throws IndexShardRecoveryException {

We need to handle two scenarios here.

Assume we created an index with 1 primary (P) and 1 Search Replica (S), which were assigned as follows:

  • Node2 → Primary (P)
  • Node3 → Search Replica (S)

Scenario 1: Node3 leaves the cluster and joins back

If Node3 leaves the cluster and later rejoins, and we attempt to allocate the shard using the EMPTY_STORE strategy, the above method detects existing files in the shard directory. However, due to the EMPTY_STORE strategy, this condition cleans up the directory.

The existing comment in the code states:

// it exists on the directory, but shouldn't exist on the FS, its a leftover (possibly dangling)  
// it's a "new index create" API, we have to do something, so better to clean it than use same data  

I didn’t fully understand the intent behind this, but it seems to be handling a case where there are leftover/dangling files that need cleanup.

Proposed Approach

To handle this case, I suggest modifying this logic to load the files even when the strategy is EMPTY_STORE, instead of deleting them outright.

Additionally, we can simplify the method using the Strategy pattern to remove the existing complex and confusing if-else conditions and make the logic clearer.

I’m confident in this change because IndicesStore already ensures that dangling index files are deleted only when all shards of the index are assigned elsewhere: Here is the method:

indicesService.deleteShardStore("no longer used", shardId, currentState);

Since IndicesStore handles cleanup properly, I don’t see any edge cases where we would still need [this logic]

@mch2 Could you confirm if this approach makes sense?

Scenario 2: Node3 leaves permanently, and Node4 is added

In this case, the shard is allocated with the EMPTY_STORE strategy, and everything works as expected as shard will get assigned to Node4 with empty store.

@vinaykpud
Copy link
Contributor Author

vinaykpud commented Feb 17, 2025

The behavior of Unassigned Shards differs for Primary and Replica shards.

  • If an index has 1 Primary (P) and no replicas, and the node hosting the primary leaves the cluster, the primary shard becomes unassigned, and the Cluster Manager will not attempt to allocate it to another node and the primary stays unassigned.

  • However, for replica shards, if an index has 1 Primary (P) and 1 Replica (R), and the node hosting the replica shard leaves the cluster, the replica becomes unassigned. The Cluster Manager waits for 1 minute to see if the same node rejoins if yes it will allocate to same node. If it does not, the manager tries to allocate the replica to a different node in the cluster.

So thats why this logic breaks :

@mch2
Copy link
Member

mch2 commented Feb 25, 2025

I would prefer not to update the meaning of EMPTY_STORE. EMPTY_STORE should mean just that and start fresh. Can we update the recoverySource to be existing store and iff the shard copy does not exist on the node we fall back to empty?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Status: In Progress
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

4 participants