Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix explain action on query rewrite #17286

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fen-qin
Copy link

@fen-qin fen-qin commented Feb 6, 2025

Description

There is an exception when call explain in "by_doc_id" mode. Response looks like this:

{
    "error": {
        "root_cause": [
            {
                "type": "query_shard_exception",
                "reason": "failed to create query: async actions are left after rewrite",
                "index": "my-nlp-index-1",
                "index_uuid": "I6_pzGG3QBWw0B6KeOJ1pA"
            }
        ],
        "type": "query_shard_exception",
        "reason": "failed to create query: async actions are left after rewrite",
        "index": "my-nlp-index-1",
        "index_uuid": "I6_pzGG3QBWw0B6KeOJ1pA",
        "caused_by": {
            "type": "illegal_state_exception",
            "reason": "async actions are left after rewrite"
        }
    },
    "status": 400
}

The error were caught from TransportExplainAction -> Rewriteable.java because there still be asyncActions left when the flag assertNoAsyncTasks returned as true.

Screenshot 2025-01-29 at 8 29 43 PM

The conflict is down to:

  1. The NeuralQueryBuilder or TermsQueryBuilder doRewrite method introduces asynchronous actions via queryRewriteContext.registerAsyncAction. So it then returns a context with potentially unresolved async tasks.

  2. Rewritable.rewrite then checksassertNoAsyncTasks flag, throws the exception if there exists any async tasks left.

Fix in this PR is resolve this conflict:

  • by moving the query rewrite to coordinator - TransportExplainAction.
  • verify the changes from local using TermsQueryBuilder
    • Before fix
    • Screenshot 2025-01-30 at 9 30 29 PM
    • After fix
    • Screenshot 2025-01-30 at 9 24 10 PM

Related Issues

Resolves - opensearch-project/neural-search#1126

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Feb 6, 2025

❌ Gradle check result for 3b159cd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a re-submittal of #17277 which has unresolved comments. Please reference earlier PRs when re-submitting them so that reviewers have the full context of previous comments (and can make sure they are resolved).

Code looks like it addresses the problem, but it's overly complex chaining an action listener with a null that gets ignored. Suggest moving the null check up to the top of the method and calling super.doExecute() (basically the existing code, but wrapped in a null check with a shortcut return before creating more objects). Only if the query is non-null do you need to instantiate the actionListener and chain that code.

Also looks like some checks are failing like DCO and Change log.

Comment on lines 112 to 113
if (request.query() == null) {
rewriteListener.onResponse(request.query());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This completes the rewriteListener with null. I'm not sure that's what you intend. Following the code path it looks like it just eventually calls super.doExecute(task, request, listener); so perhaps it's better to put that line here rather than chaining the listener.

That would allow you to simplify your code by just putting the wrapped ActionListener in the rewriteAndFetch() branch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinator node must rewrite the request before forwarding it to any shard. Rewriting on shards is prohibited when there are pending asynchronous operations.

  • Rewritable.rewriteAndFetch() to perform query rewriting to register and execute the async actions from queryRewriteContext.registerAsyncAction
  • After rewriting the query, the code must pass the rewritten query (not the original query) to super.doExecute(). This ensures the updated request containing the rewritten query is used during execution.

I found there is another place in core that handle this rewritable issue before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there is another place in core that handle this rewritable issue before

I see, but I'd make the same comments about that code. And your code has now diverged from that in that you're using an assert for non-null, which won't actually run in production; you'd just pass the null query to rewriteAndFetch().

static <T extends Rewriteable<T>> void rewriteAndFetch(
T original,
QueryRewriteContext context,
ActionListener<T> rewriteResponse,
int iteration
) {
T builder = original;
try {
for (T rewrittenBuilder = builder.rewrite(context); rewrittenBuilder != builder; rewrittenBuilder = builder.rewrite(context)) {

You'd be passing that null on line 112 and getting an NPE on line 119. So you need to put the null check back in closer to what you had (or what the linked code does).

The point I'm making both with your original code and the other example you linked is that this:

        if (request.query() == null) {
            rewriteListener.onResponse(request.query());
        } 

is really

        if (request.query() == null) {
            rewriteListener.onResponse(null);
        } 

And if you trace that into the rewrite listener you're passing a null rewrittenQuery into this block:

        rewrittenQuery -> {
            request.query(rewrittenQuery);
            super.doExecute(task, request, listener);
        }

This sets request.query(null) (but already was null!) and then calls super.doExecute() with that "updated" request (which is really the same request with a null query).

So my suggestion (and I'd suggest the same on the linked code but that's out of scope) is to move the null check and super.doExecute() call to the very top of the method, before spending any time creating new objects.

    protected void doExecute(Task task, ExplainRequest request, ActionListener<ExplainResponse> listener) {
        request.nowInMillis = System.currentTimeMillis();
        // if there's no query we can't rewrite it
        if (request.query() == null) {
            super.doExecute(task, request, listener);
            return;
        }
        // Rewrite the query if present
        ActionListener<QueryBuilder> rewriteListener = ActionListener.wrap(...<existing code>...);
        Rewriteable.rewriteAndFetch(
            request.query(),
            searchService.getIndicesService().getRewriteContext(() -> request.nowInMillis),
            rewriteListener
        );
    }    

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot, Dan for the detailed explanation. previously, I thought all request would need a rewrite. now it makes sense. raised another revision accordingly. Appreciate for your time!

Copy link
Contributor

❌ Gradle check result for 2dcc19b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for a111f2f:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 9e7e026: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 0389c20: SUCCESS

Copy link

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Project coverage is 72.46%. Comparing base (e397903) to head (a5778cb).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...nsearch/action/explain/TransportExplainAction.java 0.00% 10 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17286      +/-   ##
============================================
- Coverage     72.47%   72.46%   -0.01%     
- Complexity    65693    65697       +4     
============================================
  Files          5304     5304              
  Lines        304981   304990       +9     
  Branches      44238    44239       +1     
============================================
- Hits         221033   221023      -10     
+ Misses        65808    65792      -16     
- Partials      18140    18175      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

✅ Gradle check result for ceadf37: SUCCESS

Copy link
Contributor

❕ Gradle check result for a5778cb: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some style nits that I won't hold up approval on.

client().admin()
.indices()
.prepareCreate("twitter")
.setMapping("user", "type=integer", "followers", "type=integer")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Style Nit) It's not clear to the casual observer that this is a set of two key/value pairs. While this is valid, there are other options to use a Map for setting these keys. This is a test class so it's really not a big deal.

.indices()
.prepareCreate("twitter")
.setMapping("user", "type=integer", "followers", "type=integer")
.setSettings(Settings.builder().put(SETTING_NUMBER_OF_SHARDS, 2).put("index.number_of_routing_shards", 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Style Nit) It's strange to see one setting as a constant and the other one not. I think you can use INDEX_NUMBER_OF_ROUTING_SHARDS_SETTING here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants