Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: CastExpr Ordering Handle #10650

Merged
merged 10 commits into from
May 27, 2024
Merged

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented May 24, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

A minor fix in the CastExpr ordering. Boolean to numeric cast needs to preserve ordering.

What changes are included in this PR?

Fixed CastExpr ordering for Boolean to numeric cast

Are these changes tested?

with existing tests

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels May 24, 2024
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we lose generality by doing this once at the listing table factory level (instead of the old way of doing it in create_physical_plan). I'd appreciate thoughts from @devinjdangelo and @alamb

.has_header
.unwrap_or(state.config_options().catalog.has_header),
)
.with_header(first_chunk && self.options.has_header)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is a catalog option on header? This might not be tested in the code since the SQL tests wouldn't cover here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @metegenez for the review :) I have tried to extract the option from the catalog in the most inclusive way possible. This step seems to be when we create a TableProvider.

Another alternative would be to consult the catalog wherever we read the header from a FileFormat. However, doing this everywhere makes the code cluttered (in many places, there isn't even a catalog object available). Therefore, I think that if we check this once during the execution, where FileFormat is first introduced—that is, at the part where TableProvider is created—we won't need to check it again. Are there any use cases for creating and using FileFormat independently of any TableProvider?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @berkaysynnada

In general I think we will be able to merge these PRs faster if you can break them up into multiple PRs that are not related. I realize this takes more work on the author's behalf (aka you) but it results in more efficient reviews because it is easier to see the effects of a particular change (rather than have three separate changes intermixed)

The change to serializing proto encoding and cargo.tom looks good to me. I don't fully understand the rationale or implications to change the config value from Option -- the way it currently is (if not explicitly set fall back to value on ConfigOptions) makes more sense to me

@@ -80,7 +80,7 @@ itertools = { workspace = true }
log = { workspace = true }
md-5 = { version = "^0.10.0", optional = true }
rand = { workspace = true }
regex = { worksapce = true, optional = true }
regex = { workspace = true, optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- that is a nice fix -- it would be easier to merge as a separate PR

@@ -1106,7 +1106,7 @@ message CsvWriterOptions {

// Options controlling CSV format
message CsvOptions {
bytes has_header = 1; // Indicates if the CSV has a header row
bool has_header = 1; // Indicates if the CSV has a header row
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is certainly nicer

@ozankabak
Copy link
Contributor

ozankabak commented May 25, 2024

Maybe I can help clarify things a little bit. The purpose of the PR is not to change the current fallback behavior, just simplify its implementation. @berkaysynnada is moving the fallback logic to TableProviderFactory's create (from FileFormat's create_physical_plan), which enables him to simplify the rest of the code. I see that he implemented this for ListingTableFactory. The proto change is a consequence of the above -- it can't be made independently AFAICT.

The upside of the approach in this PR is that it simplifies the code across multiple files (e.g. proto), but the fallback behavior no longer becomes automatic for all custom table providers as far as I can tell -- @berkaysynnada please correct me if I'm wrong.

Maybe we can do the following: Instead of making every TableProviderFactory responsible for implementing this behavior in create, we can add some general mechanism at the trait level (maybe via a default method implementation?) to make this fallback automatic for all TableProviderFactory implementations unless overridden. @alamb, what do you think? Any ideas come to mind?

@alamb
Copy link
Contributor

alamb commented May 25, 2024

Maybe we can do the following: Instead of making every TableProviderFactory responsible for implementing this behavior in create, we can add some general mechanism at the trait level (maybe via a default method implementation?) to make this fallback automatic for all TableProviderFactory implementations unless overridden. @alamb, what do you think? Any ideas come to mind?

Thank you for the explanation -- what you describe makes sense -- maybe we can update the documentation to add some of this context and that would be good enough. I haven't thought through all the nuances of general fallbacks, but given what you say perhaps it is a corner case we can worry about if/when somone actually hits it.

From my perspective the most important thing is for the behavior to be clearly documented.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 27, 2024
@berkaysynnada
Copy link
Contributor Author

The upside of the approach in this PR is that it simplifies the code across multiple files (e.g. proto), but the fallback behavior no longer becomes automatic for all custom table providers as far as I can tell -- @berkaysynnada please correct me if I'm wrong.

This explanation summarizes it very well. Currently, the extraction of catalog information is encapsulated in the create() method of TableProviderFactory. It seems tidier to have this extraction centralized in one spot (or we could move it to another recommended location if there is any) rather than having it scattered. However, as @ozankabak mentioned, a custom TableProvider should also handle this. I can add an explanation for this in the docstring of the trait method.

@berkaysynnada
Copy link
Contributor Author

After talking with @ozankabak, we decided to undo the changes related to header options.

There's also another place where SessionState is used in the FileFormat API:

Suggesting that people should extract SessionState values during TableProvider creation might not be safe. I'm not sure if removing SessionState from this API is the right move, but it definitely needs more thought. Rushing to clean up the code could lead to losing functionality. We should discuss these in a different issue.

For now, this PR only makes a small fix to CastExpr.

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, let's update the PR title and body prior to merging

@berkaysynnada berkaysynnada changed the title Minor: Csv Options Clean-up Minor: CastExpr Ordering Handle May 27, 2024
@berkaysynnada berkaysynnada force-pushed the feature/minor-header-cast branch 2 times, most recently from d83ddf0 to c47dca6 Compare May 27, 2024 11:37
@berkaysynnada berkaysynnada force-pushed the feature/minor-header-cast branch from c47dca6 to 5e2e023 Compare May 27, 2024 11:41
@github-actions github-actions bot removed the physical-expr Physical Expressions label May 27, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- this change for cast looks good to me

I also double checked that casting bool to integer preserves ordering:

> create table t(b boolean, x integer);
0 row(s) fetched.
Elapsed 0.006 seconds.

> insert into t values (true, true::int);
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.

> insert into t values (false, false::int);
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.001 seconds.

> select * from t order by b;
+-------+---+
| b     | x |
+-------+---+
| false | 0 |
| true  | 1 |
+-------+---+
2 row(s) fetched.
Elapsed 0.009 seconds.

> select * from t order by x;
+-------+---+
| b     | x |
+-------+---+
| false | 0 |
| true  | 1 |
+-------+---+
2 row(s) fetched.

@alamb alamb merged commit 2e6beb4 into apache:main May 27, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* header option removed

* Update csv.rs

* Update path_partition.rs

* Update path_partition.rs

* adding test

* adding test

* Revert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants