Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Oct 2, 2025

Which issue does this PR close?

Rationale for this change

Missing feature

What changes are included in this PR?

Add decimal unshredding support, which should have been straightforward except:

  1. The variant decimal types are not generic and do not implement any common trait that lets us generalize the logic easily. I added a custom trait in the unshredding module as a workaround, but we should probably look at something similar to arrow's DecimalType trait for VariantDecimalXX classes to implement.
  2. The parquet reader seems to have a bug (feature?) that forces 32- and 64-bit decimal columns to Decimal128 unless the reader specifically requests a narrower type. Which causes the variant decimal integration tests to fail because they receive Variant::Decimal16 values when they expected Variant::Decimal4 or Variant::Decimal8 (the actual values are correct). Rather than directly tackle the bug in arrow-parquet itself (which has a large blast radius), I updated VariantArray constructor to cast such columns back to the correct type as needed.

Are these changes tested?

Yes. The variant decimal integration tests now pass where they used to fail.

Are there any user-facing changes?

No.

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Oct 2, 2025
@scovich
Copy link
Contributor Author

scovich commented Oct 2, 2025

@alamb -- I don't know what to make of this parquet decimal widening thing? Do you know [somebody who knows] what might be going on here?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scovich -- the unshredding part looks good to me and I left a comment about the parquet changes


Ok(DataType::Decimal256(precision, scale))
// Dispatch based on precision thresholds using DecimalType trait constants
if precision <= Decimal32Type::MAX_PRECISION {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed for this PR?

Decimal32/Decimal64 were just added recently and it is not widely supported by all the kernels, or downstream systems (e.g. DataFusion)

I think this change means that some parquet files that currently are read as Decimal128 would come back as Decimal32/Decimal64 which is likely to be a pretty big change to some consumers.

I would suggest backing out this part of the change, if possble, and filing a separate ticket to discuss changing how decimal data is read from existing parquet files

If we want to proceed, we should probably need to add additional tests that show what happens when decimals are read from parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed for this PR?

Without it, almost all variant decimal integration tests fail because they expect Variant::Decimal4 or Variant::Decimal8 and they get Variant::Decimal16 instead. I don't know any good way to "fix" the test expectations and anyway the expectations are arguably correct and should not change.

That said...

I suspect the current version is too aggressive -- it uses precision as the ultimate lower bound regardless of the physical encoding; I'm working on a revised version that uses precision as an upper bound, with the physical encoding as lower bound.

Still TBD whether that reduces the test carnage at all, let alone whether it's actually the right approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know any good way to "fix" the test expectations and anyway the expectations are arguably correct and should not change.

I recommend special casing the test harness to special case cast the Decimal types with a comment to a ticket
that tracks adding the support for real

I don't think we need to solve this Parquet --> Decimal32/64 support in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added the corrective cast to the VariantArray constructor, which already fixes up the value and metadata columns? The variant spec arguably requires using the narrowest possible decimal type for a given precision:

Decimal Precision Decimal value type Variant Physical Type
1 <= precision <= 9 int32 decimal4
10 <= precision <= 18 int64 decimal8
19 <= precision <= 38 int128 decimal16
> 38 Not supported  

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The casting fix worked! Even better, it's < 20 LoC and isolated completely to variant code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downside is... the cast.

So we should still try to figure out a way to make the parquet reader pull the correct type:

  • either globally (some variation of the fix I first attempted here)
  • or perhaps something more targeted in the parquet reader, which notices the variant type extension and massages the footer schema accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed as #8549

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

Decimal64(p, s) if is_valid_variant_decimal(p, s, 18) => borrow!(),
Decimal128(p, s) if is_valid_variant_decimal(p, s, 38) => borrow!(),
//
// NOTE: arrow-parquet reads widens 32- and 64-bit decimals to 128-bit, but the variant spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@scovich scovich marked this pull request as ready for review October 2, 2025 20:09
@scovich
Copy link
Contributor Author

scovich commented Oct 3, 2025

@alamb -- conflicts resolved, should be ready for merge now

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich

@alamb alamb merged commit 8e5d826 into apache:main Oct 3, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] [Shredding] Support typed_access for Decimal128

2 participants