-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache common referred expression at the window input #9009
Conversation
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @mustafasrepo -- this PR looks good to me. Also, I found the description on this PR very clear and well written. Thank you very much 🙏
One thought I had was will there be a problem if there is a subquery that would end up with a nested WindowAggExec that could be incorrectly optimized away 🤔
Something like
SELECT c3,
SUM(c9) OVER(ORDER BY c3+c4 ASC) as sum2,
sum1,
FROM (
SELECT c3, c4, c9,
SUM(c9) OVER(ORDER BY c3+c4 DESC) as sum1,
FROM aggregate_test_100
)
cc @waynexia and @haohuaijin
let input_schema = Arc::clone(input.schema()); | ||
let arrays = | ||
to_arrays(window_expr, input_schema, &mut expr_set, ExprMask::Normal)?; | ||
// Get all window expressions inside the consecutive window operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can add a comment here about why this is recursively looking down into all window operations (e.g. because they all get the same input schema and append on some window functions, but the window functions can't refer to previous window functions).
I think perhaps you could reuse the (very nicely written) description from this PR which explains it very well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mustafasrepo and @alamb, look great to me!
I also have the same question as @alamb.
Co-authored-by: Huaijin <[email protected]>
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
I think, in these cases, we will generate a sub-optimal plan, where a complex expression is calculated more than once by subsequent operators. However, didn't cached (Previous behaviour). However, I don't think we will generate an invalid plan. I added your example as a test case also in this PR. I think as a future PR, we can analyze plan from top down to count expression referral count, for better calculating referral counts across plan. |
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
Thank you Thank you for this contribution @mustafasrepo -- this PR looks good to me. Also, I found the description on this PR very clear and well written. Thank you very much 🙏 One thought I had was will there be a problem if there is a subquery that would end up with a nested WindowAggExec that could be incorrectly optimized away 🤔 Something like SELECT c3,
Yes, I agree there is no need to optimize this case as part of this PR, and since it gives correct results, lets 🚀 |
Which issue does this PR close?
Closes #.
Rationale for this change
The PR8960 retracted 2 of the window tests. This PR fixes these retracted tests also adds a new feature for caching common expressions at the window input.
As an example consider the following query
which will generate following logical plan.
where expression
c3+c4
is computed in the firstProjection
and its result (which is a column) is used in subsequentWindowAggr
. This is done with theCommonSubexprEliminate
rule.However, for the following query
datafusion generates following plan:
where computation
c3+c4
couldn't cache with aProjection
before firstWindowAggr
. The reason is that, eachWindowAggr
refers toc3+c4
once. HenceCommonSubExpr
rule doesn't think "removing it is helpful".What changes are included in this PR?
This PR fixes above problem so that
CommonSubExpr
considers consecutive window operators during common sub expression substitute analysis. With this analysis we can generate following logical plan for the second query:where common computation is cached with
Projection
before first window.Are these changes tested?
Yes
Are there any user-facing changes?