[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

vinaykpud · 2025-02-12T00:12:51Z

Is your feature request related to a problem? Please describe

Context:
Created a 5 node cluster and created an index with 1P and 1 Search replica.

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles      cluster_manager name
172.18.0.3           26          90   5    2.19    1.84     1.69 d         data            -               opensearch-node5
172.18.0.4           23          90   5    2.19    1.84     1.69 -         coordinating    -               opensearch-node1
172.18.0.2           27          90   6    2.19    1.84     1.69 d         data            -               opensearch-node3
172.18.0.5           23          90   5    2.19    1.84     1.69 d         data            -               opensearch-node4
172.18.0.6           30          90   5    2.19    1.84     1.69 m         cluster_manager *               opensearch-node2

Now following is the shard assignment:

index    shard prirep state   docs store ip         node
products 0     p      STARTED    0  230b 172.18.0.3 opensearch-node5
products 0     s      STARTED    0  230b 172.18.0.2 opensearch-node3

simulate a node drop, Since I am running the cluster locally using docker, I stopped the node3.

index    shard prirep state      docs store ip         node
products 0     p      STARTED       0  230b 172.18.0.3 opensearch-node5
products 0     s      UNASSIGNED

After 1 min, AllocationService will try to allocate the search shard to node4 and it will fail with bellow exception

2025-02-09 13:07:53 "stacktrace": ["org.opensearch.indices.recovery.RecoveryFailedException: [products3][0]: Recovery failed on {opensearch-node8}{eHuGysErRFuGUCFO2KxGuw}{Wby_7fTEToWk5bnavKBlbA}{172.18.0.9}{172.18.0.9:9300}{dimr}{zone=zone3, shard_indexing_pressure_enabled=true}",
2025-02-09 13:07:53 "at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3902) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:618) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:123) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2919) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
2025-02-09 13:07:53 "at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
2025-02-09 13:07:53 "at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]",
2025-02-09 13:07:53 "Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed to fetch index version after copying it over",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:717) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more",
2025-02-09 13:07:53 "Caused by: org.opensearch.index.shard.IndexShardRecoveryException: shard allocated for local recovery (post api), should exist, but doesn't, current files: []",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:702) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more",
2025-02-09 13:07:53 "Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/usr/share/opensearch/data/nodes/0/indices/PFeBY9eRTaKcRhDKEM2WAQ/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@52539624)): files: []",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:808) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:764) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:542) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:526) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]",
2025-02-09 13:07:53 "at org.opensearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:135) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.store.Store.readSegmentsInfo(Store.java:255) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:237) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:692) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
2025-02-09 13:07:53 "... 8 more"] }

Describe the solution you'd like

This is happens because in ShardRouting

OpenSearch/server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java

Line 443 in d0a65d3

if (primary() || isSearchOnly()) {

when moveToUnassigned is called, we set the recoverySource as ExistingStoreRecoverySource for the search replica.
Since this scenario involves in recovering the shard in another node, it wont have any files in the local store for recovery and fails with exception. So solution is when its search replica always use EmptyStoreRecoverySource.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

mch2 · 2025-02-13T20:34:57Z

@vinaykpud This will impact cases where a SR node is restarted, can you check our recovery logic to see if thats the case? Ie. we still want to diff local segments and ensure we only fetch wahts required to recover.

vinaykpud · 2025-02-13T23:53:43Z

Added integ test for reproducing this : d89c1cf

vinaykpud · 2025-02-14T00:19:22Z

@mch2 Yes. If the SR node restarts, we should consider loading the available local files instead of starting with an empty directory. In this case, the node restart causes the search replica to become unassigned. When the node comes back up, the allocator attempts to reassign the search replica to the same node. Since the shard was previously assigned to this node, it should already have the necessary files. Therefore, if any local files exist, we should load them.

vinaykpud · 2025-02-17T19:41:05Z

Following is the method that executes when allocating a shard to a node:

OpenSearch/server/src/main/java/org/opensearch/index/shard/StoreRecovery.java

Line 680 in 91a93da

    
           private void internalRecoverFromStore(IndexShard indexShard) throws IndexShardRecoveryException {

We need to handle two scenarios here.

Assume we created an index with 1 primary (P) and 1 Search Replica (S), which were assigned as follows:

Node2 → Primary (P)
Node3 → Search Replica (S)

Scenario 1: Node3 leaves the cluster and joins back

If Node3 leaves the cluster and later rejoins, and we attempt to allocate the shard using the EMPTY_STORE strategy, the above method detects existing files in the shard directory. However, due to the EMPTY_STORE strategy, this condition cleans up the directory.

The existing comment in the code states:

// it exists on the directory, but shouldn't exist on the FS, its a leftover (possibly dangling)  
// it's a "new index create" API, we have to do something, so better to clean it than use same data

I didn’t fully understand the intent behind this, but it seems to be handling a case where there are leftover/dangling files that need cleanup.

Proposed Approach

To handle this case, I suggest modifying this logic to load the files even when the strategy is EMPTY_STORE, instead of deleting them outright.

Additionally, we can simplify the method using the Strategy pattern to remove the existing complex and confusing if-else conditions and make the logic clearer.

I’m confident in this change because IndicesStore already ensures that dangling index files are deleted only when all shards of the index are assigned elsewhere: Here is the method:

OpenSearch/server/src/main/java/org/opensearch/indices/store/IndicesStore.java

Line 343 in 91a93da

indicesService.deleteShardStore("no longer used", shardId, currentState);

Since IndicesStore handles cleanup properly, I don’t see any edge cases where we would still need [this logic]

@mch2 Could you confirm if this approach makes sense?

Scenario 2: Node3 leaves permanently, and Node4 is added

In this case, the shard is allocated with the EMPTY_STORE strategy, and everything works as expected as shard will get assigned to Node4 with empty store.

vinaykpud · 2025-02-17T20:02:21Z

The behavior of Unassigned Shards differs for Primary and Replica shards.

If an index has 1 Primary (P) and no replicas, and the node hosting the primary leaves the cluster, the primary shard becomes unassigned, and the Cluster Manager will not attempt to allocate it to another node and the primary stays unassigned.
However, for replica shards, if an index has 1 Primary (P) and 1 Replica (R), and the node hosting the replica shard leaves the cluster, the replica becomes unassigned. The Cluster Manager waits for 1 minute to see if the same node rejoins if yes it will allocate to same node. If it does not, the manager tries to allocate the replica to a different node in the cluster.

So thats why this logic breaks :

OpenSearch/server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java

Line 443 in 91a93da

if (primary() || isSearchOnly()) {

mch2 · 2025-02-25T22:06:57Z

I would prefer not to update the meaning of EMPTY_STORE. EMPTY_STORE should mean just that and start fresh. Can we update the recoverySource to be existing store and iff the shard copy does not exist on the node we fall back to empty?

vinaykpud added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 12, 2025

github-actions bot added the Search:Performance label Feb 12, 2025

github-project-automation bot added this to Search Project Board Feb 12, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 12, 2025

sandeshkr419 assigned vinaykpud and prudhvigodithi Feb 12, 2025

sandeshkr419 removed the untriaged label Feb 12, 2025

getsaurabh02 added this to Performance Roadmap Feb 17, 2025

github-project-automation bot moved this to Todo in Performance Roadmap Feb 17, 2025

getsaurabh02 moved this from Todo to In Progress in Performance Roadmap Feb 17, 2025

vinaykpud mentioned this issue Feb 22, 2025

[META] Reader/Writer Separation #15306

Open

16 tasks

vinaykpud linked a pull request Mar 1, 2025 that will close this issue

Restrict Search Replicas to Allocate only to Search dedicated node #17457

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

vinaykpud commented Feb 12, 2025 •

edited

Loading

mch2 commented Feb 13, 2025

vinaykpud commented Feb 13, 2025

vinaykpud commented Feb 14, 2025 •

edited

Loading

vinaykpud commented Feb 17, 2025

vinaykpud commented Feb 17, 2025 •

edited

Loading

mch2 commented Feb 25, 2025

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

[RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

Comments

vinaykpud commented Feb 12, 2025 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

mch2 commented Feb 13, 2025

vinaykpud commented Feb 13, 2025

vinaykpud commented Feb 14, 2025 • edited Loading

vinaykpud commented Feb 17, 2025

Scenario 1: Node3 leaves the cluster and joins back

Proposed Approach

Scenario 2: Node3 leaves permanently, and Node4 is added

vinaykpud commented Feb 17, 2025 • edited Loading

mch2 commented Feb 25, 2025

vinaykpud commented Feb 12, 2025 •

edited

Loading

vinaykpud commented Feb 14, 2025 •

edited

Loading

vinaykpud commented Feb 17, 2025 •

edited

Loading