-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Decimal unshredding support #8540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@alamb -- I don't know what to make of this parquet decimal widening thing? Do you know [somebody who knows] what might be going on here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scovich -- the unshredding part looks good to me and I left a comment about the parquet changes
|
||
Ok(DataType::Decimal256(precision, scale)) | ||
// Dispatch based on precision thresholds using DecimalType trait constants | ||
if precision <= Decimal32Type::MAX_PRECISION { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed for this PR?
Decimal32/Decimal64 were just added recently and it is not widely supported by all the kernels, or downstream systems (e.g. DataFusion)
I think this change means that some parquet files that currently are read as Decimal128 would come back as Decimal32/Decimal64 which is likely to be a pretty big change to some consumers.
I would suggest backing out this part of the change, if possble, and filing a separate ticket to discuss changing how decimal data is read from existing parquet files
If we want to proceed, we should probably need to add additional tests that show what happens when decimals are read from parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed for this PR?
Without it, almost all variant decimal integration tests fail because they expect Variant::Decimal4
or Variant::Decimal8
and they get Variant::Decimal16
instead. I don't know any good way to "fix" the test expectations and anyway the expectations are arguably correct and should not change.
That said...
I suspect the current version is too aggressive -- it uses precision as the ultimate lower bound regardless of the physical encoding; I'm working on a revised version that uses precision as an upper bound, with the physical encoding as lower bound.
Still TBD whether that reduces the test carnage at all, let alone whether it's actually the right approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know any good way to "fix" the test expectations and anyway the expectations are arguably correct and should not change.
I recommend special casing the test harness to special case cast the Decimal types with a comment to a ticket
that tracks adding the support for real
I don't think we need to solve this Parquet --> Decimal32/64 support in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we added the corrective cast to the VariantArray
constructor, which already fixes up the value and metadata columns? The variant spec arguably requires using the narrowest possible decimal type for a given precision:
Decimal Precision Decimal value type Variant Physical Type 1 <= precision <= 9 int32 decimal4 10 <= precision <= 18 int64 decimal8 19 <= precision <= 38 int128 decimal16 > 38 Not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The casting fix worked! Even better, it's < 20 LoC and isolated completely to variant code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downside is... the cast.
So we should still try to figure out a way to make the parquet reader pull the correct type:
- either globally (some variation of the fix I first attempted here)
- or perhaps something more targeted in the parquet reader, which notices the variant type extension and massages the footer schema accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed as #8549
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it!
Decimal64(p, s) if is_valid_variant_decimal(p, s, 18) => borrow!(), | ||
Decimal128(p, s) if is_valid_variant_decimal(p, s, 38) => borrow!(), | ||
// | ||
// NOTE: arrow-parquet reads widens 32- and 64-bit decimals to 128-bit, but the variant spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@alamb -- conflicts resolved, should be ready for merge now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich
Which issue does this PR close?
Decimal128
#8332Rationale for this change
Missing feature
What changes are included in this PR?
Add decimal unshredding support, which should have been straightforward except:
DecimalType
trait forVariantDecimalXX
classes to implement.Variant::Decimal16
values when they expectedVariant::Decimal4
orVariant::Decimal8
(the actual values are correct). Rather than directly tackle the bug in arrow-parquet itself (which has a large blast radius), I updatedVariantArray
constructor to cast such columns back to the correct type as needed.Are these changes tested?
Yes. The variant decimal integration tests now pass where they used to fail.
Are there any user-facing changes?
No.