fix explain action on query rewrite #17286

fen-qin · 2025-02-06T21:31:37Z

Description

There is an exception when call explain in "by_doc_id" mode. Response looks like this:

{
    "error": {
        "root_cause": [
            {
                "type": "query_shard_exception",
                "reason": "failed to create query: async actions are left after rewrite",
                "index": "my-nlp-index-1",
                "index_uuid": "I6_pzGG3QBWw0B6KeOJ1pA"
            }
        ],
        "type": "query_shard_exception",
        "reason": "failed to create query: async actions are left after rewrite",
        "index": "my-nlp-index-1",
        "index_uuid": "I6_pzGG3QBWw0B6KeOJ1pA",
        "caused_by": {
            "type": "illegal_state_exception",
            "reason": "async actions are left after rewrite"
        }
    },
    "status": 400
}

The error were caught from TransportExplainAction -> Rewriteable.java because there still be asyncActions left when the flag assertNoAsyncTasks returned as true.

The conflict is down to:

The NeuralQueryBuilder or TermsQueryBuilder doRewrite method introduces asynchronous actions via queryRewriteContext.registerAsyncAction. So it then returns a context with potentially unresolved async tasks.
Rewritable.rewrite then checksassertNoAsyncTasks flag, throws the exception if there exists any async tasks left.

Fix in this PR is resolve this conflict:

by moving the query rewrite to coordinator - TransportExplainAction.
verify the changes from local using TermsQueryBuilder
- Before fix
- After fix

Related Issues

Resolves - opensearch-project/neural-search#1126

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-02-06T22:28:20Z

❌ Gradle check result for 3b159cd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

dbwiddis

This looks like a re-submittal of #17277 which has unresolved comments. Please reference earlier PRs when re-submitting them so that reviewers have the full context of previous comments (and can make sure they are resolved).

Code looks like it addresses the problem, but it's overly complex chaining an action listener with a null that gets ignored. Suggest moving the null check up to the top of the method and calling super.doExecute() (basically the existing code, but wrapped in a null check with a shortcut return before creating more objects). Only if the query is non-null do you need to instantiate the actionListener and chain that code.

Also looks like some checks are failing like DCO and Change log.

dbwiddis · 2025-02-10T15:49:12Z

server/src/main/java/org/opensearch/action/explain/TransportExplainAction.java

+        if (request.query() == null) {
+            rewriteListener.onResponse(request.query());


This completes the rewriteListener with null. I'm not sure that's what you intend. Following the code path it looks like it just eventually calls super.doExecute(task, request, listener); so perhaps it's better to put that line here rather than chaining the listener.

That would allow you to simplify your code by just putting the wrapped ActionListener in the rewriteAndFetch() branch.

The coordinator node must rewrite the request before forwarding it to any shard. Rewriting on shards is prohibited when there are pending asynchronous operations.

Rewritable.rewriteAndFetch() to perform query rewriting to register and execute the async actions from queryRewriteContext.registerAsyncAction

After rewriting the query, the code must pass the rewritten query (not the original query) to super.doExecute(). This ensures the updated request containing the rewritten query is used during execution.

I found there is another place in core that handle this rewritable issue before

00b21f8#diff-9f08370a1694770e86f7e0fbb6ce470f7f3f67d340f42a70d184f19b8b77c1b3R78

I found there is another place in core that handle this rewritable issue before

I see, but I'd make the same comments about that code. And your code has now diverged from that in that you're using an assert for non-null, which won't actually run in production; you'd just pass the null query to rewriteAndFetch().

OpenSearch/server/src/main/java/org/opensearch/index/query/Rewriteable.java

Lines 111 to 119 in e397903

static <T extends Rewriteable<T>> void rewriteAndFetch(

T original,

QueryRewriteContext context,

ActionListener<T> rewriteResponse,

int iteration

) {

T builder = original;

try {

for (T rewrittenBuilder = builder.rewrite(context); rewrittenBuilder != builder; rewrittenBuilder = builder.rewrite(context)) {

You'd be passing that null on line 112 and getting an NPE on line 119. So you need to put the null check back in closer to what you had (or what the linked code does).

The point I'm making both with your original code and the other example you linked is that this:

if (request.query() == null) { rewriteListener.onResponse(request.query()); }

is really

if (request.query() == null) { rewriteListener.onResponse(null); }

And if you trace that into the rewrite listener you're passing a null rewrittenQuery into this block:

rewrittenQuery -> { request.query(rewrittenQuery); super.doExecute(task, request, listener); }

This sets request.query(null) (but already was null!) and then calls super.doExecute() with that "updated" request (which is really the same request with a null query).

So my suggestion (and I'd suggest the same on the linked code but that's out of scope) is to move the null check and super.doExecute() call to the very top of the method, before spending any time creating new objects.

protected void doExecute(Task task, ExplainRequest request, ActionListener<ExplainResponse> listener) { request.nowInMillis = System.currentTimeMillis(); // if there's no query we can't rewrite it if (request.query() == null) { super.doExecute(task, request, listener); return; } // Rewrite the query if present ActionListener<QueryBuilder> rewriteListener = ActionListener.wrap(...<existing code>...); Rewriteable.rewriteAndFetch( request.query(), searchService.getIndicesService().getRewriteContext(() -> request.nowInMillis), rewriteListener ); }

thanks a lot, Dan for the detailed explanation. previously, I thought all request would need a rewrite. now it makes sense. raised another revision accordingly. Appreciate for your time!

github-actions · 2025-02-12T22:29:30Z

❌ Gradle check result for 2dcc19b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-02-13T21:08:24Z

❌ Gradle check result for a111f2f:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-02-21T01:28:23Z

❌ Gradle check result for 9e7e026: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-02-21T23:09:26Z

✅ Gradle check result for 0389c20: SUCCESS

codecov · 2025-02-21T23:10:19Z

Codecov Report

Attention: Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Project coverage is 72.46%. Comparing base (e397903) to head (a5778cb).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...nsearch/action/explain/TransportExplainAction.java	0.00%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #17286      +/-   ##
============================================
- Coverage     72.47%   72.46%   -0.01%     
- Complexity    65693    65697       +4     
============================================
  Files          5304     5304              
  Lines        304981   304990       +9     
  Branches      44238    44239       +1     
============================================
- Hits         221033   221023      -10     
+ Misses        65808    65792      -16     
- Partials      18140    18175      +35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Fen Qin <[email protected]>

github-actions · 2025-02-25T21:50:52Z

✅ Gradle check result for ceadf37: SUCCESS

Signed-off-by: Fen Qin <[email protected]>

github-actions · 2025-02-26T01:59:57Z

❕ Gradle check result for a5778cb: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

dbwiddis

LGTM with some style nits that I won't hold up approval on.

dbwiddis · 2025-02-28T07:45:48Z

server/src/internalClusterTest/java/org/opensearch/explain/ExplainActionIT.java

+        client().admin()
+            .indices()
+            .prepareCreate("twitter")
+            .setMapping("user", "type=integer", "followers", "type=integer")


(Style Nit) It's not clear to the casual observer that this is a set of two key/value pairs. While this is valid, there are other options to use a Map for setting these keys. This is a test class so it's really not a big deal.

dbwiddis · 2025-02-28T07:46:13Z

server/src/internalClusterTest/java/org/opensearch/explain/ExplainActionIT.java

+            .indices()
+            .prepareCreate("twitter")
+            .setMapping("user", "type=integer", "followers", "type=integer")
+            .setSettings(Settings.builder().put(SETTING_NUMBER_OF_SHARDS, 2).put("index.number_of_routing_shards", 2))


(Style Nit) It's strange to see one setting as a constant and the other one not. I think you can use INDEX_NUMBER_OF_ROUTING_SHARDS_SETTING here.

github-actions bot added the enhancement Enhancement or improvement to existing feature or request label Feb 6, 2025

opensearch-ci-bot mentioned this pull request Feb 7, 2025

[AUTOCUT] Gradle Check Flaky Test Report for IndexServiceTests #14407

Open

dbwiddis reviewed Feb 10, 2025

View reviewed changes

fen-qin force-pushed the explain-fix branch from 3b159cd to 2dcc19b Compare February 12, 2025 21:33

opensearch-ci-bot mentioned this pull request Feb 13, 2025

[AUTOCUT] Gradle Check Flaky Test Report for RemotePrimaryLocalRecoveryIT #14314

Open

fen-qin force-pushed the explain-fix branch from 2dcc19b to a111f2f Compare February 13, 2025 20:34

opensearch-ci-bot mentioned this pull request Feb 14, 2025

[AUTOCUT] Gradle Check Flaky Test Report for RemoteRestoreSnapshotIT #14324

Open

fen-qin force-pushed the explain-fix branch from a111f2f to 9e7e026 Compare February 21, 2025 00:34

fen-qin force-pushed the explain-fix branch from 9e7e026 to 0389c20 Compare February 21, 2025 22:16

opensearch-ci-bot mentioned this pull request Feb 25, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SpecificClusterManagerNodesIT #15944

Open

fen-qin added a commit to fen-qin/OpenSearch that referenced this pull request Feb 25, 2025

Fix explain action on query rewrite (opensearch-project#17286)

ceadf37

Signed-off-by: Fen Qin <[email protected]>

fen-qin force-pushed the explain-fix branch from 0389c20 to ceadf37 Compare February 25, 2025 20:54

Fix explain action on query rewrite (opensearch-project#17286)

a5778cb

Signed-off-by: Fen Qin <[email protected]>

fen-qin force-pushed the explain-fix branch from ceadf37 to a5778cb Compare February 26, 2025 01:05

opensearch-ci-bot mentioned this pull request Feb 26, 2025

[AUTOCUT] Gradle Check Flaky Test Report for DedicatedClusterSnapshotRestoreIT #15806

Open

dbwiddis approved these changes Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix explain action on query rewrite #17286

fix explain action on query rewrite #17286

fen-qin commented Feb 6, 2025 •

edited

Loading

github-actions bot commented Feb 6, 2025

dbwiddis left a comment •

edited

Loading

dbwiddis Feb 10, 2025

fen-qin Feb 21, 2025

dbwiddis Feb 26, 2025

fen-qin Feb 26, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 13, 2025

github-actions bot commented Feb 21, 2025

github-actions bot commented Feb 21, 2025

codecov bot commented Feb 21, 2025 •

edited

Loading

github-actions bot commented Feb 25, 2025

github-actions bot commented Feb 26, 2025

dbwiddis left a comment

dbwiddis Feb 28, 2025

dbwiddis Feb 28, 2025

		if (request.query() == null) {
		rewriteListener.onResponse(request.query());

	static <T extends Rewriteable<T>> void rewriteAndFetch(
	T original,
	QueryRewriteContext context,
	ActionListener<T> rewriteResponse,
	int iteration
	) {
	T builder = original;
	try {
	for (T rewrittenBuilder = builder.rewrite(context); rewrittenBuilder != builder; rewrittenBuilder = builder.rewrite(context)) {

fix explain action on query rewrite #17286

Are you sure you want to change the base?

fix explain action on query rewrite #17286

Conversation

fen-qin commented Feb 6, 2025 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Feb 6, 2025

dbwiddis left a comment • edited Loading

Choose a reason for hiding this comment

dbwiddis Feb 10, 2025

Choose a reason for hiding this comment

fen-qin Feb 21, 2025

Choose a reason for hiding this comment

dbwiddis Feb 26, 2025

Choose a reason for hiding this comment

fen-qin Feb 26, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 13, 2025

github-actions bot commented Feb 21, 2025

github-actions bot commented Feb 21, 2025

codecov bot commented Feb 21, 2025 • edited Loading

Codecov Report

github-actions bot commented Feb 25, 2025

github-actions bot commented Feb 26, 2025

dbwiddis left a comment

Choose a reason for hiding this comment

dbwiddis Feb 28, 2025

Choose a reason for hiding this comment

dbwiddis Feb 28, 2025

Choose a reason for hiding this comment

fen-qin commented Feb 6, 2025 •

edited

Loading

dbwiddis left a comment •

edited

Loading

codecov bot commented Feb 21, 2025 •

edited

Loading