Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367

Closed
Dandandan opened this issue Nov 26, 2021 · 5 comments · Fixed by #1368
Closed
Labels
bug Something isn't working datafusion Changes in the datafusion crate performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Nov 26, 2021

Describe the bug
Fastest I get on master is 10s. After #1366 it's around 7s,
This used to be <2s.
Looking at the plan it looks like some filters are not pushed down successfully (o_orderdate@9 >= 8674 AND o_orderdate@9 < 8766 AND l_returnflag@13 = R)

After #1319 we added some alias to constant folding. Might be good to only do this for Projection (at least, not in Filter).

As you can see the filter has an AS - which makes the filter push down to not work.

[2021-11-26T17:12:17Z DEBUG datafusion::execution::context] Optimized logical plan:
     Sort: #revenue DESC NULLS FIRST
      Projection: #customer.c_custkey, #customer.c_name, #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS revenue, #customer.c_acctbal, #nation.n_name, #customer.c_address, #customer.c_phone, #customer.c_comment
        Aggregate: groupBy=[[#customer.c_custkey, #customer.c_name, #customer.c_acctbal, #customer.c_phone, #nation.n_name, #customer.c_address, #customer.c_comment]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount)]]
          Join: #customer.c_nationkey = #nation.n_nationkey
            Filter: #orders.o_orderdate >= Date32("8674") AND #orders.o_orderdate < Date32("8766") AND #lineitem.l_returnflag = Utf8("R") AS orders.o_orderdate >= CAST(Utf8("1993-10-01") AS Date32) AND orders.o_orderdate < CAST(Utf8("1994-01-01") AS Date32) AND lineitem.l_returnflag = Utf8("R")
              Join: #orders.o_orderkey = #lineitem.l_orderkey
                Join: #customer.c_custkey = #orders.o_custkey
                  TableScan: customer projection=Some([0, 1, 2, 3, 4, 5, 7])
                  TableScan: orders projection=Some([0, 1, 4])
                TableScan: lineitem projection=Some([0, 5, 6, 8])
            TableScan: nation projection=Some([0, 1])

To Reproduce
Use latest master to run q10 against some parquet data. Use an older version (to be found out how old).

Expected behavior

Additional context

@Dandandan Dandandan added bug Something isn't working performance Make DataFusion faster labels Nov 26, 2021
@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 26, 2021

It seems it might be related to #1319 (adding an alias in constant folding avoids pushing down the filter)

@Dandandan Dandandan changed the title TPC-H q10 performance regression TPC-H q10 performance regression (expression for filter with added alias is not pushed down) Nov 26, 2021
@alamb
Copy link
Contributor

alamb commented Nov 27, 2021

I think it would be wise to add a test for TPCH plans to datafusion, so that any changes in the plans are apparently when the changes are made.

@Dandandan
Copy link
Contributor Author

I think it would be wise to add a test for TPCH plans to datafusion, so that any changes in the plans are apparently when the changes are made.

Yes - fully agreed.

@alamb
Copy link
Contributor

alamb commented Nov 29, 2021

🎉

@houqp
Copy link
Member

houqp commented Dec 2, 2021

nice catch @Dandandan :D

@alamb alamb added the datafusion Changes in the datafusion crate label Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
3 participants