Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Parquet column writer Dictionary(_, Decimal128) and Dictionary(_, Decimal256) #6987

Merged
merged 1 commit into from
Jan 20, 2025

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Jan 16, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

Currently Parquet reads/writes for Arrow Dictionary(Decimal) type where Decimal is INT-based is falling back to arrow_cast::cast, which leads to losing precision.

What changes are included in this PR?

For Dictionary(Decimal) type where decimal is int-based, writer and reader now perform arrow_cast::cast only for dictionary packing/unpacking, while values are only casted to required integer type without being modified, similar to how regular Decimal works.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 16, 2025
@korowa korowa force-pushed the fix-write-dict-decimal branch from b2ad521 to 8f0c20c Compare January 16, 2025 05:26
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but would appreciate a second opinion. Nice catch, thanks!

cc @alamb

@@ -270,6 +270,59 @@ where

Arc::new(array) as ArrayRef
}
ArrowType::Dictionary(_, value_type) => match value_type.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. It would be nice if we could get some code reuse here.

@@ -2670,6 +2712,52 @@ mod tests {
one_column_roundtrip_with_schema(Arc::new(d), schema);
}

#[test]
fn arrow_writer_decimal128_dictionary() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that tests fail if the read changes are reverted.

}
.with_precision_and_scale(*p, *s)?;

arrow_cast::cast(&array, target_type)?
Copy link
Contributor

@etseidl etseidl Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a second look, I'm curious why the cast here rather than creating the ArrayRef directly as is done above? Will this have any performance implications?

Edit: ah, ok. We need to cast to dictionary. Disregard 😅

@alamb alamb changed the title fix: column writer for dictionary decimal primitive type fix: Parquet column writer Dictionary(_, Decimal128) and Dictionary(_, Decimal256) Jan 17, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also took a good look and this PR makes sense to me. Thank you @korowa and @etseidl (I really appreciate the review help)

I'll plan to merge this in the next day or two unless anyone else would like time to comment

@alamb alamb merged commit 0503df4 into apache:main Jan 20, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 20, 2025

Thanks again @korowa and @etseidl

@alamb alamb added the bug label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants