-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487
Conversation
@HyukjinKwon @cloud-fan Can you look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db Thanks for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix!
LGTM
@gene-db I have now added support for variants in createDataFrame in Spark Connect as well. Can you review again since it modifies one of the code paths that you worked on? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db Thanks! I left a few questions.
@@ -333,6 +340,7 @@ def convert(data: Sequence[Any], schema: StructType) -> "pa.Table": | |||
LocalDataToArrowConversion._create_converter( | |||
field.dataType, | |||
field.nullable, | |||
variants_as_dicts = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know when to set this to true or false? It is not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly a hack because the data produced by these converters are almost directly fed to a PyArrow API to create a PyArrow table later in the method. Now, this API doesn't know how to deal with VariantVal and since it's a third party library, we cannot do anything about it.
The Arrow schema is a struct with metadata stating that it is a Variant. So, we try to get the data as dict which would be converted into Arrow structs by the PyArrow API.
I have set it to true only in this specific part of the codebase so I can get createDataFrame
to work. I am thinking of cleaner ways of doing this but if I find something I could merge that as a follow-up.
Ideally Arrow should have its own Variant type (which can be defined using Arrow extension types). There was some discussion about it.
): | ||
return VariantVal(value["value"], value["metadata"]) | ||
elif isinstance(value, VariantVal) and variants_as_dicts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a matrix of inputs we could get?
value
isVariantVal
&variants_as_dicts
isFalse
: not handled?value
isVariantVal
&variants_as_dicts
isTrue
: handled here, returnsdict
value
isdict
&variants_as_dicts
isFalse
: handled above, returnsVariantVal
value
isdict
&variants_as_dicts
isTrue
: not handled?
What do we do for the cases we are not handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Question. For now, we should throw an error in the other cases as I am not aware of any code paths we can use to test them. I specifically set variants_as_dicts
to false in one particular case which was encountered during createDataFrame
.
To be more specific:
value
isVariantVal
&variants_as_dicts
isFalse
: This was not handled before (value
isVariantVal
was not handled at all) and is still not handled => no regression.value
isdict
&variants_as_dicts
isTrue
:variants_as_dicts
isTrue
is only possible in one codepath - the one where I have setvariants_as_dicts
to true. Earlier, this would have returned Variants and we would see an error later on anyway (in pa.Table.from_arrays). I don't think this can cause any regressions.
Co-authored-by: Gene Pang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes!
LGTM
thanks, merging to master! |
What changes were proposed in this pull request?
In this PR, we add a case to handle None in
VariantType.toInternal
. Also, variants can be used withcreateDataFrame
when using Spark Connect.Why are the changes needed?
Previously,
spark.createDataFrame([(VariantVal(bytearray([12, 1]), bytearray([1, 0, 0])),), (None,)], "v variant").show()
failed because there was no way of handling nulls.Also,
createDataFrame
did not work with Variants prior to this PR - now it does.Does this PR introduce any user-facing change?
Yes, it fixes a bug where
None
values couldn't be handled with Variant schemas, and allows users to use createDataFrame with Variants in the Python client.How was this patch tested?
Unit test
Was this patch authored or co-authored using generative AI tooling?
No