[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487

harshmotw-db · 2025-01-14T20:46:57Z

What changes were proposed in this pull request?

In this PR, we add a case to handle None in VariantType.toInternal. Also, variants can be used with createDataFrame when using Spark Connect.

Why are the changes needed?

Previously, spark.createDataFrame([(VariantVal(bytearray([12, 1]), bytearray([1, 0, 0])),), (None,)], "v variant").show() failed because there was no way of handling nulls.

Also, createDataFrame did not work with Variants prior to this PR - now it does.

Does this PR introduce any user-facing change?

Yes, it fixes a bug where None values couldn't be handled with Variant schemas, and allows users to use createDataFrame with Variants in the Python client.

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

…ariant_bug_fix

harshmotw-db · 2025-01-14T20:54:42Z

@HyukjinKwon @cloud-fan Can you look at this?

gene-db

@harshmotw-db Thanks for the fix!

python/pyspark/sql/tests/test_types.py

gene-db

Thanks for this fix!

LGTM

harshmotw-db · 2025-01-15T00:06:38Z

@gene-db I have now added support for variants in createDataFrame in Spark Connect as well. Can you review again since it modifies one of the code paths that you worked on?

gene-db

@harshmotw-db Thanks! I left a few questions.

python/pyspark/sql/connect/conversion.py

gene-db · 2025-01-15T00:28:24Z

python/pyspark/sql/connect/conversion.py

@@ -333,6 +340,7 @@ def convert(data: Sequence[Any], schema: StructType) -> "pa.Table":
            LocalDataToArrowConversion._create_converter(
                field.dataType,
                field.nullable,
+                variants_as_dicts = True


How do we know when to set this to true or false? It is not clear to me.

This is mostly a hack because the data produced by these converters are almost directly fed to a PyArrow API to create a PyArrow table later in the method. Now, this API doesn't know how to deal with VariantVal and since it's a third party library, we cannot do anything about it.

The Arrow schema is a struct with metadata stating that it is a Variant. So, we try to get the data as dict which would be converted into Arrow structs by the PyArrow API.

I have set it to true only in this specific part of the codebase so I can get createDataFrame to work. I am thinking of cleaner ways of doing this but if I find something I could merge that as a follow-up.

Ideally Arrow should have its own Variant type (which can be defined using Arrow extension types). There was some discussion about it.

gene-db · 2025-01-15T00:29:44Z

python/pyspark/sql/connect/conversion.py

                ):
                    return VariantVal(value["value"], value["metadata"])
+                elif isinstance(value, VariantVal) and variants_as_dicts:


Isn't there a matrix of inputs we could get?

value is VariantVal & variants_as_dicts is False: not handled?

value is VariantVal & variants_as_dicts is True: handled here, returns dict

value is dict & variants_as_dicts is False: handled above, returns VariantVal

value is dict & variants_as_dicts is True: not handled?

What do we do for the cases we are not handling?

Good Question. For now, we should throw an error in the other cases as I am not aware of any code paths we can use to test them. I specifically set variants_as_dicts to false in one particular case which was encountered during createDataFrame.

To be more specific:

value is VariantVal & variants_as_dicts is False: This was not handled before (value is VariantVal was not handled at all) and is still not handled => no regression.

value is dict & variants_as_dicts is True: variants_as_dicts is True is only possible in one codepath - the one where I have set variants_as_dicts to true. Earlier, this would have returned Variants and we would see an error later on anyway (in pa.Table.from_arrays). I don't think this can cause any regressions.

Co-authored-by: Gene Pang <[email protected]>

gene-db

Thanks for the fixes!

LGTM

cloud-fan · 2025-01-15T08:14:07Z

thanks, merging to master!

harshmotw-db added 8 commits September 19, 2024 23:33

add support for duplicate keys in from_json(_, 'variant')

efb2f4a

Addressed @MaxGekk's comment

8672883

regenerated golden files

57a71aa

Merge branch 'master' of https://github.com/harshmotw-db/spark

e703e34

Merge branch 'master' of https://github.com/harshmotw-db/spark

8afece5

fix

0df6c44

Merge branch 'master' of https://github.com/harshmotw-db/spark into v…

188919e

…ariant_bug_fix

fixed createDataFrame bug

dcd6441

github-actions bot added SQL PYTHON labels Jan 14, 2025

gene-db reviewed Jan 14, 2025

View reviewed changes

python/pyspark/sql/tests/test_types.py Show resolved Hide resolved

added tests for nested variants

18588a2

harshmotw-db requested a review from gene-db January 14, 2025 22:16

gene-db approved these changes Jan 14, 2025

View reviewed changes

HyukjinKwon approved these changes Jan 15, 2025

View reviewed changes

added variant support in createDataFrame in Spark Connect

9e68675

github-actions bot added the CONNECT label Jan 15, 2025

harshmotw-db changed the title ~~[SPARK-50815] Fix bug where passing null Variants in createDataFrame causes it to fail~~ [SPARK-50815] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect Jan 15, 2025

harshmotw-db requested a review from gene-db January 15, 2025 00:08

gene-db reviewed Jan 15, 2025

View reviewed changes

harshmotw-db and others added 4 commits January 14, 2025 16:39

Update python/pyspark/sql/connect/conversion.py

42284cb

Co-authored-by: Gene Pang <[email protected]>

lint

578a5c4

minor fix

0b09421

spelling

8b647a5

gene-db approved these changes Jan 15, 2025

View reviewed changes

cloud-fan approved these changes Jan 15, 2025

View reviewed changes

cloud-fan closed this in 39bb2d8 Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487

[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487

harshmotw-db commented Jan 14, 2025 •

edited

Loading

harshmotw-db commented Jan 14, 2025

gene-db left a comment

gene-db left a comment

harshmotw-db commented Jan 15, 2025

gene-db left a comment

gene-db Jan 15, 2025

harshmotw-db Jan 15, 2025

gene-db Jan 15, 2025

harshmotw-db Jan 15, 2025

gene-db left a comment

cloud-fan commented Jan 15, 2025

[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487

[SPARK-50815][PYTHON][SQL] Fix bug where passing null Variants in createDataFrame causes it to fail and add Variant support in createDataFrame in Spark Connect #49487

Conversation

harshmotw-db commented Jan 14, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

harshmotw-db commented Jan 14, 2025

gene-db left a comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

harshmotw-db commented Jan 15, 2025

gene-db left a comment

Choose a reason for hiding this comment

gene-db Jan 15, 2025

Choose a reason for hiding this comment

harshmotw-db Jan 15, 2025

Choose a reason for hiding this comment

gene-db Jan 15, 2025

Choose a reason for hiding this comment

harshmotw-db Jan 15, 2025

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

cloud-fan commented Jan 15, 2025

harshmotw-db commented Jan 14, 2025 •

edited

Loading