-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: CastExpr Ordering Handle #10650
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we lose generality by doing this once at the listing table factory level (instead of the old way of doing it in create_physical_plan
). I'd appreciate thoughts from @devinjdangelo and @alamb
.has_header | ||
.unwrap_or(state.config_options().catalog.has_header), | ||
) | ||
.with_header(first_chunk && self.options.has_header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there is a catalog option on header? This might not be tested in the code since the SQL tests wouldn't cover here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @metegenez for the review :) I have tried to extract the option from the catalog in the most inclusive way possible. This step seems to be when we create a TableProvider
.
Another alternative would be to consult the catalog wherever we read the header from a FileFormat
. However, doing this everywhere makes the code cluttered (in many places, there isn't even a catalog object available). Therefore, I think that if we check this once during the execution, where FileFormat
is first introduced—that is, at the part where TableProvider
is created—we won't need to check it again. Are there any use cases for creating and using FileFormat
independently of any TableProvider
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @berkaysynnada
In general I think we will be able to merge these PRs faster if you can break them up into multiple PRs that are not related. I realize this takes more work on the author's behalf (aka you) but it results in more efficient reviews because it is easier to see the effects of a particular change (rather than have three separate changes intermixed)
The change to serializing proto encoding and cargo.tom looks good to me. I don't fully understand the rationale or implications to change the config value from Option
-- the way it currently is (if not explicitly set fall back to value on ConfigOptions) makes more sense to me
datafusion/functions/Cargo.toml
Outdated
@@ -80,7 +80,7 @@ itertools = { workspace = true } | |||
log = { workspace = true } | |||
md-5 = { version = "^0.10.0", optional = true } | |||
rand = { workspace = true } | |||
regex = { worksapce = true, optional = true } | |||
regex = { workspace = true, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- that is a nice fix -- it would be easier to merge as a separate PR
@@ -1106,7 +1106,7 @@ message CsvWriterOptions { | |||
|
|||
// Options controlling CSV format | |||
message CsvOptions { | |||
bytes has_header = 1; // Indicates if the CSV has a header row | |||
bool has_header = 1; // Indicates if the CSV has a header row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is certainly nicer
Maybe I can help clarify things a little bit. The purpose of the PR is not to change the current fallback behavior, just simplify its implementation. @berkaysynnada is moving the fallback logic to The upside of the approach in this PR is that it simplifies the code across multiple files (e.g. proto), but the fallback behavior no longer becomes automatic for all custom table providers as far as I can tell -- @berkaysynnada please correct me if I'm wrong. Maybe we can do the following: Instead of making every |
Thank you for the explanation -- what you describe makes sense -- maybe we can update the documentation to add some of this context and that would be good enough. I haven't thought through all the nuances of general fallbacks, but given what you say perhaps it is a corner case we can worry about if/when somone actually hits it. From my perspective the most important thing is for the behavior to be clearly documented. |
This explanation summarizes it very well. Currently, the extraction of catalog information is encapsulated in the |
After talking with @ozankabak, we decided to undo the changes related to header options. There's also another place where
SessionState values during TableProvider creation might not be safe. I'm not sure if removing SessionState from this API is the right move, but it definitely needs more thought. Rushing to clean up the code could lead to losing functionality. We should discuss these in a different issue.
For now, this PR only makes a small fix to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, let's update the PR title and body prior to merging
d83ddf0
to
c47dca6
Compare
c47dca6
to
5e2e023
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada -- this change for cast looks good to me
I also double checked that casting bool to integer preserves ordering:
> create table t(b boolean, x integer);
0 row(s) fetched.
Elapsed 0.006 seconds.
> insert into t values (true, true::int);
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> insert into t values (false, false::int);
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select * from t order by b;
+-------+---+
| b | x |
+-------+---+
| false | 0 |
| true | 1 |
+-------+---+
2 row(s) fetched.
Elapsed 0.009 seconds.
> select * from t order by x;
+-------+---+
| b | x |
+-------+---+
| false | 0 |
| true | 1 |
+-------+---+
2 row(s) fetched.
* header option removed * Update csv.rs * Update path_partition.rs * Update path_partition.rs * adding test * adding test * Revert
Which issue does this PR close?
Closes #.
Rationale for this change
A minor fix in the
CastExpr
ordering. Boolean to numeric cast needs to preserve ordering.What changes are included in this PR?
Fixed
CastExpr
ordering for Boolean to numeric castAre these changes tested?
with existing tests
Are there any user-facing changes?