Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487

Conversation

harshmotw-db
Copy link
Contributor

@harshmotw-db harshmotw-db commented Jan 14, 2025

What changes were proposed in this pull request?

In this PR, we add a case to handle None in VariantType.toInternal. Also, variants can be used with createDataFrame when using Spark Connect.

Why are the changes needed?

Previously, spark.createDataFrame([(VariantVal(bytearray([12, 1]), bytearray([1, 0, 0])),), (None,)], "v variant").show() failed because there was no way of handling nulls.

Also, createDataFrame did not work with Variants prior to this PR - now it does.

Does this PR introduce any user-facing change?

Yes, it fixes a bug where None values couldn't be handled with Variant schemas, and allows users to use createDataFrame with Variants in the Python client.

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

@harshmotw-db
Copy link
Contributor Author

@HyukjinKwon @cloud-fan Can you look at this?

Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks for the fix!

@harshmotw-db harshmotw-db requested a review from gene-db January 14, 2025 22:16
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!

LGTM

@harshmotw-db
Copy link
Contributor Author

@gene-db I have now added support for variants in createDataFrame in Spark Connect as well. Can you review again since it modifies one of the code paths that you worked on?

@harshmotw-db harshmotw-db changed the title [SPARK-50815] Fix bug where passing null Variants in createDataFrame causes it to fail [SPARK-50815] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect Jan 15, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-50815] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect [SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect Jan 15, 2025
@harshmotw-db harshmotw-db requested a review from gene-db January 15, 2025 00:08
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks! I left a few questions.

@@ -333,6 +340,7 @@ def convert(data: Sequence[Any], schema: StructType) -> "pa.Table":
LocalDataToArrowConversion._create_converter(
field.dataType,
field.nullable,
variants_as_dicts = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know when to set this to true or false? It is not clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a hack because the data produced by these converters are almost directly fed to a PyArrow API to create a PyArrow table later in the method. Now, this API doesn't know how to deal with VariantVal and since it's a third party library, we cannot do anything about it.

The Arrow schema is a struct with metadata stating that it is a Variant. So, we try to get the data as dict which would be converted into Arrow structs by the PyArrow API.

I have set it to true only in this specific part of the codebase so I can get createDataFrame to work. I am thinking of cleaner ways of doing this but if I find something I could merge that as a follow-up.

Ideally Arrow should have its own Variant type (which can be defined using Arrow extension types). There was some discussion about it.

):
return VariantVal(value["value"], value["metadata"])
elif isinstance(value, VariantVal) and variants_as_dicts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a matrix of inputs we could get?

  • value is VariantVal & variants_as_dicts is False: not handled?
  • value is VariantVal & variants_as_dicts is True: handled here, returns dict
  • value is dict & variants_as_dicts is False: handled above, returns VariantVal
  • value is dict & variants_as_dicts is True: not handled?

What do we do for the cases we are not handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Question. For now, we should throw an error in the other cases as I am not aware of any code paths we can use to test them. I specifically set variants_as_dicts to false in one particular case which was encountered during createDataFrame.

To be more specific:

  • value is VariantVal & variants_as_dicts is False: This was not handled before (value is VariantVal was not handled at all) and is still not handled => no regression.
  • value is dict & variants_as_dicts is True: variants_as_dicts is True is only possible in one codepath - the one where I have set variants_as_dicts to true. Earlier, this would have returned Variants and we would see an error later on anyway (in pa.Table.from_arrays). I don't think this can cause any regressions.

Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 39bb2d8 Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants