[EPIC] [Parquet] Implement Variant type support in Parquet #6736

alamb · 2024-11-15T20:50:48Z

CurtHagenlocher · 2024-11-15T21:02:52Z

There's an implementation in Spark (try here for starters) but when I last looked ~two months ago there was no binary test data; only some round trips via JSON.

tustvold · 2024-12-04T15:05:55Z

I do wonder if a precursor to supporting this would be some way to translate / represent the variant data in arrow, whilst there are non-arrow APIs, they'd likely struggle to accommodate this addition, and aren't how the vast majority of people consume parquet data using this crate.

findepi · 2024-12-04T15:18:26Z

From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields?

A fully performance variant implementation should be able to leverage file-level column disaggregation (shredding), but I do think this could come as a follow-up to a "normal" Variant type implementation.

tustvold · 2024-12-04T16:38:26Z

From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields?

I don't know, I've not really been following the variant proposal close enough to weigh in here. However, my understanding is that shredding is one of the major motivators for this getting added to parquet, as without it you might as well just embed any record format, e.g. Avro. I therefore suspect most use-cases will be at least partially shredded, and the reader will need to handle this case. This is especially true given the variant_value is NULL when the data is shredded, as opposed to say duplicating the content (which would have its own issues TBC), and so we can't just ignore the shredded data.

Unfortunately I can't see an obvious way to be able to represent this sort of semi-structured data within the arrow format without introducing a new DataType that is able to accommodate arrays having the same type, but different child layouts...

TLDR I suspect actioning this will require arrow defining a way to represent semi-structured data...

findepi · 2024-12-04T20:16:31Z

There needs to be a way to represent a series of variant values having "no type in common" (variant integer, variant boolean, variant varchar, etc all mixed up). For that some blob-like representation with internal structure seems natural.
Then there should be a way to carry-on the shredded columns without having to put them back into that blob, so yes, one type, different child layouts.
It feels to me that the runtime representation will end up being similar to what is defined in Parquet (https://github.com/apache/parquet-format/blob/master/VariantShredding.md)... so maybe it should be the same representation to provide for an efficient read path.

findepi · 2024-12-04T20:37:09Z

When considering what to do in Arrow, we should also keep an eye on the ongoing effort in Iceberg apache/iceberg#10392 (comment)
This could inform some design decisions.
cc @Xuanwo

alamb · 2025-01-24T09:49:18Z

There is now a related proposal being added to paruqet (shredding the variant type:

Simplify Variant shredding and refactor for clarity parquet-format#461

FWIW I would love to have some rust representation (aka someone who wanted to implement the variant type in the rust parquet decoder). If you are interested there are relevant conversations going on on the parquet mailing list

alamb · 2025-01-24T09:50:32Z

to be clear, we may not be able to merge / really use variants in rust until it gets into Arrow, but we can work out how it would work in parquet (maybe with the non arrow interfaces) first

CurtHagenlocher · 2025-01-25T16:55:18Z

There's a pure Python implementation in the Spark repo. It's almost standalone, having a dependency only on PySparkValueError.

alamb · 2025-03-06T11:32:20Z

I spent some time listening and thinking about this on the parquet call yesterday: https://lists.apache.org/thread/cnn6264g56jktrwmplz89x8cgkcvr4ql

Note there is a thread on the arrow mailing list about adding variant support in arrow-rs:

https://lists.apache.org/thread/lsmkmxsp1qvjzn497z582hjm0w8hmg0n

(and it looks like @wjones127 made some sort of demo using an extension type in datafusion):

https://github.com/datafusion-contrib/datafusion-functions-variant

Unfortunately I can't see an obvious way to be able to represent this sort of semi-structured data within the arrow format

What I suggest is that the parquet reader reads variant columns as Binary / LargeBinary with an arrow extension type annotation, which would let downstream projects interpret / read the extension type correctly

I think one challenge will be "how to tell the parquet writer to write / annotate the columns as variant"

Before we can do anything useful with the variant type, we'll need a library to parse / interpret a variant value (aka the equivalent of a JSON parser / set of objects)

alamb · 2025-03-06T11:36:45Z

So the first step as I see it is that someone has to code up / find a Rust implementation for working with variant values. This could be a port or inspired by the ones @CurtHagenlocher 's points at here:

There's a pure Python implementation in the Spark repo.

There's an implementation in Spark (try here for starters)

Once we have such a library, we can then figure out if/how it should be used in the parquet reader/writer directly

wjones127 · 2025-03-06T17:05:49Z

I haven't had time to work on this recently, but for a rust implementation for working with variant values, anyone should feel free to work off what I had started in the https://github.com/datafusion-contrib/datafusion-functions-variant/tree/main/open-variant repo. There is a core open-variant crate there that's just meant to be reading and writing variant values.

alamb · 2025-03-06T20:59:27Z

The current variant spec appears to be

https://github.com/apache/parquet-format/blob/37b6e8b863fb510314c07649665251f6474b0c11/VariantEncoding.md

alamb · 2025-03-06T22:30:56Z

Since variant is part of the parquet spec now I think the code to interpret could easily belong on the arrow-rs repository and the parquet module. It will be an interesting Rust API design challenge I think to make a really efficient/zero copy decoder

alamb · 2025-03-08T13:14:13Z

Here is a PR to implement variant in C/C++ from @neilechao

GH-45937: [C++][Parquet] Variant logical type definition arrow#45375

alamb · 2025-04-01T23:00:55Z

I have requested example VARIANTs on the mailing list as well:

https://lists.apache.org/thread/22dvcnm7v5d30slzc3hp8d9qq8syj1dq

adriangb · 2025-04-06T13:54:25Z

Digging deep into some Spark code I found some pretty enlightening information about how this will actually be encoded into Parquet: apache/spark@3c3d1a6#diff-ca9eeead72220965c7bbd52631f7125d4c1ef22b898e5baec83abc7be9495325

So it seems that apache/datafusion#2581 / apache/datafusion#11745 will ultimately be a blocker for proper support.

I think the things we'll need here are:

Ability to project individual struct fields, in particular column -> typed_value -> field_name, for selection and during predicate pushdown pruning
Functions that operate on the entire structure and know how to parse the binary metadata/value fields
A type that you can declare at the schema level that doesn't force you to exhaustively define the unknown typed fields of the struct
Statistics support for nested struct fields

On the DataFusion side I think all we need is something like apache/datafusion#15057 to allow rewriting a filter or projection such as variant_get(col, 'key') = 5 into "col.typed_value.key.typed_value" = 5 on a per-file level if we see from the file schema that a is shredded. Then if all of the above is in place stats filtering, selecting reading of the column for filtering / projection, etc.

alamb · 2025-04-06T18:44:15Z

I have some news I would like to share here -- it seems that @PinkCrow007 has actually been working on a variant implementation in parquet (including support in arrow-rs as an extension type)

Here is an update from Martin Prammer (not sure if he has a github handle)

We've made progress towards implementing the Variant type in both Parquet_rs and Arrow_rs and have prepared a document, shared as a Google doc, that details the overall project and our current status. In summary, our current prototypes are focused on round-tripping binary data between Parquet and Arrow. The Arrow-side Variant is implemented as a CanonicalExtensionType, while the Parquet-side Variant is a LogicalType. If you're interested in looking at the code early, Jiaying's fork is publically available. Our next goal is to implement binary data decoding/encoding to facilitate using Variants as a stand-alone type, which will then allow us to implement Variant shredding. While there's still work to do before we have the basic functionality for a Variant type, we plan to PR the baseline variant and then address shredding.

At this point, it would be helpful for our team to connect to the broader Apache ecosystem's discussion on Variants; Jiaying has already joined the Arrow discord, and we're both happy to join any relevant mailing lists. We're also soliciting existing Variant implementations that we can use to verify our library against.

It seem they also need some example variant data to make faster progress. I will go beg some more from the parquet mailing list

It is very exciting to see the momentum picking up

alamb · 2025-04-06T18:54:23Z

Said begging email: https://lists.apache.org/thread/jr0yds4o97rtdkc7dmmsk4ck0odvc9h8

alamb · 2025-04-07T10:51:36Z

I got a response from @cashmand ❤ and I filed an issue in the parquet-testing repo to track the work to add examples

Add example Variant data and parquet files parquet-testing#75

adriangb · 2025-04-07T13:25:21Z

Here is an update from Martin Prammer (not sure if he has a github handle)

Seems like @mprammer does 😄

alamb · 2025-04-08T19:30:32Z

I just had a discussion with @PinkCrow007 and @PinkCrow007, as I understand it the next steps will be:

Create a draft / work in progress PR that we can start reviewing / providing feedback on based on the fork: main...PinkCrow007:arrow-rs:variant-clean.

The eventual goal will be to break it up and start merging it in as pieces:

Support for reading binary (&[u8]) as variants (accessing via fields, etc) -- what appears to be in arrow-variant/src/encoder in the fork
Support for reading/writing to parquet
SUpport for shredding, etc

It is going to be so great

alamb · 2025-04-08T19:32:53Z

I took a quick skim through the code in main...PinkCrow007:arrow-rs:variant-clean and I found it easy to understand and well structured. I am very much looking forward to the PR

alamb · 2025-04-13T13:25:07Z

Update in case anyone didn't see it: @PinkCrow007 has created a draft PR for comment:

Variant Support for Arrow and Parquet [DRAFT] #7404

Related: C/C++ implementation in arrow:

GH-45937: [C++][Parquet] Variant logical type definition arrow#45375

alamb · 2025-04-13T20:58:08Z

I have been studying the variant spec, and various implementation. It seems variant support to/from json is quite well covered. There are things in the spec (like Time, for example) that are not well represented/implemented in open source spark. I'll. definitely focus on the JSON stuff first

Add example Variant data and parquet files parquet-testing#75 (comment)

alamb · 2025-04-18T19:01:24Z

I spent a while today writing up some first sugested steps and linked them to the ticket

It would be great if someone wanted to take a crack at

Variant: Rust API to Read Variant Values #7423

I think that will unblock a lot of the rest of what is going on

I am out the rest of the week but hopefully I can check email occasionally

alamb · 2025-04-28T14:49:38Z

@findepi -- I heard today you may be working on variant support as well. I wonder if you have any thoughts about the above plan (or perhaps already have it implemented and would be willing to share 😆 )

alamb · 2025-05-08T19:57:27Z

Status report (also reported to the parquet mailing list): https://lists.apache.org/thread/dy22njos6c0wbo82s377wvbobbd7y6lx

I am pretty stoked to report progress on a Rust Variant implementation (see epic here0).

We have added example binary data in parquet-testing1 (Thanks Micah for the review and merge)
Jiaying's prototype 2 is looking good and we are preparing to start merging it in pieces
I have created a PR to add a parquet-variant crate to arrow-rs3 (looking for a review 🎣)

Once we have the parquet-variant crate merged, I expect a series of PRs that incrementally adds support for different parts of the Variant specification. Once this is done, we will move on to integrating it into the parquet decoder.

Exciting times,
Andrew

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 15, 2024

This was referenced Nov 15, 2024

[Format] Consider adding an official variant type to Arrow apache/arrow#42069

Open

GH-455: Add Variant specification docs apache/parquet-format#456

Merged

alamb added the parquet Changes to the parquet crate label Nov 15, 2024

scovich mentioned this issue Feb 4, 2025

Parsing a string column containing JSON values into a typed array #6522

Closed

alamb mentioned this issue Feb 5, 2025

Project Ideas for GSoC 2025 (Google Summer of Code) apache/datafusion#14478

Open

alamb mentioned this issue Feb 26, 2025

Run / test Datafusion with JSON Bench from ClickHouse apache/datafusion#14874

Open

alamb mentioned this issue Apr 6, 2025

Support round tripping extension types in parquet #7063

Open

alamb mentioned this issue Apr 7, 2025

Add example Variant data and parquet files apache/parquet-testing#75

Open

4 tasks

PinkCrow007 mentioned this issue Apr 11, 2025

Variant Support for Arrow and Parquet [DRAFT] #7404

Draft

5 tasks

alamb changed the title ~~[Parquet] Implement Variant type support in Parquet~~ [EPIC] [Parquet] Implement Variant type support in Parquet Apr 11, 2025

alamb mentioned this issue Apr 11, 2025

[PATHFINDING] Parse json as variant #7403

Draft

This was referenced Apr 18, 2025

Variant: Rust API to Read Variant Values #7423

Open

Variant: Rust API to Create Variant Values #7424

Open

Variant: Read/Parse JSON value as Variant #7425

Open

Variant: Write Variant Values as JSON #7426

Open

This was referenced Apr 28, 2025

[DISCUSSION] DataFusion Road Map: Q3-Q4 2025 apache/datafusion#15878

Open

Open Variant Type for semi-structured data apache/datafusion#10987

Open

alamb mentioned this issue May 8, 2025

[Variant] Add (empty) parquet-variant crate, update parquet-testing pin #7485

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

alamb commented Nov 15, 2024 •

edited

Loading

CurtHagenlocher commented Nov 15, 2024

tustvold commented Dec 4, 2024 •

edited

Loading

findepi commented Dec 4, 2024

tustvold commented Dec 4, 2024

findepi commented Dec 4, 2024

findepi commented Dec 4, 2024

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

CurtHagenlocher commented Jan 25, 2025

alamb commented Mar 6, 2025

alamb commented Mar 6, 2025

wjones127 commented Mar 6, 2025

alamb commented Mar 6, 2025

alamb commented Mar 6, 2025 •

edited

Loading

alamb commented Mar 8, 2025

alamb commented Apr 1, 2025

adriangb commented Apr 6, 2025 •

edited

Loading

alamb commented Apr 6, 2025

alamb commented Apr 6, 2025

alamb commented Apr 7, 2025

adriangb commented Apr 7, 2025

alamb commented Apr 8, 2025 •

edited

Loading

alamb commented Apr 8, 2025

alamb commented Apr 13, 2025 •

edited

Loading

alamb commented Apr 13, 2025

alamb commented Apr 18, 2025

alamb commented Apr 28, 2025

alamb commented May 8, 2025

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

Comments

alamb commented Nov 15, 2024 • edited Loading

CurtHagenlocher commented Nov 15, 2024

tustvold commented Dec 4, 2024 • edited Loading

findepi commented Dec 4, 2024

tustvold commented Dec 4, 2024

findepi commented Dec 4, 2024

findepi commented Dec 4, 2024

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

CurtHagenlocher commented Jan 25, 2025

alamb commented Mar 6, 2025

alamb commented Mar 6, 2025

wjones127 commented Mar 6, 2025

alamb commented Mar 6, 2025

alamb commented Mar 6, 2025 • edited Loading

alamb commented Mar 8, 2025

alamb commented Apr 1, 2025

adriangb commented Apr 6, 2025 • edited Loading

alamb commented Apr 6, 2025

alamb commented Apr 6, 2025

alamb commented Apr 7, 2025

adriangb commented Apr 7, 2025

alamb commented Apr 8, 2025 • edited Loading

alamb commented Apr 8, 2025

alamb commented Apr 13, 2025 • edited Loading

alamb commented Apr 13, 2025

alamb commented Apr 18, 2025

alamb commented Apr 28, 2025

alamb commented May 8, 2025

alamb commented Nov 15, 2024 •

edited

Loading

tustvold commented Dec 4, 2024 •

edited

Loading

alamb commented Mar 6, 2025 •

edited

Loading

adriangb commented Apr 6, 2025 •

edited

Loading

alamb commented Apr 8, 2025 •

edited

Loading

alamb commented Apr 13, 2025 •

edited

Loading