patch: Parquet Column Names with "Special Characters" fix #109

MarquisC · 2023-10-27T15:14:14Z

We're using PyIceberg to read Iceberg tables stored in S3 as parquet. We have column names in the form of id:foo diagnostic:bar using : as a sort of delimiter to help us do some programatic maintenance on our side.

In Parquet the column names are magically subbed in this case : -> _x3A and upon attempts at scanning/reading the data the schema of the table doesn't match the physical column names for PyArrow.

The first pass is a naive fix for this that I have tested and works, but I'm looking for guidance on where you all want me to put this logic, and I'm happy to add it there instead.

mchamberlain-mdsol · 2023-10-27T15:21:13Z

Exception Example (before hack):

No match for FieldRef.Name(prefix:foo) in prefix_x3Afoo: string not null
... a bunch of other columns 
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

After the hack I can read the dataframe, would love some guidance on where you all think something like this is most appropiate.

Fokko · 2023-10-27T19:31:44Z

Thanks for raising this @MarquisC. This looks like #83, can you check if that also resolves your problem? Otherwise, I think it will be a good place to add it here as well. It also shows how to test this.

MarquisC · 2023-10-28T08:01:46Z

Will test and close, unit test wise looks like #83 is the fix.

mchamberlain-mdsol · 2023-10-28T22:35:29Z

Will test and close, unit test wise looks like #83 is the fix.

This PR can be closed, confirmed fix is in master.

Fokko · 2023-11-02T12:20:43Z

Thanks you both @MarquisC and @mchamberlain-mdsol for checking this!

MarquisC added 2 commits October 27, 2023 10:55

trying out naive hack

523ff61

removing whole file reformatting

b8c2ae3

MarquisC added 3 commits October 27, 2023 21:27

reset pyarrow.py and add specific test

7750915

adding static condition and defaulting logic

67e5a03

complete re-vert, please test

84fa075

Fokko closed this Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patch: Parquet Column Names with "Special Characters" fix #109

patch: Parquet Column Names with "Special Characters" fix #109

MarquisC commented Oct 27, 2023

mchamberlain-mdsol commented Oct 27, 2023 •

edited

Loading

Fokko commented Oct 27, 2023

MarquisC commented Oct 28, 2023

mchamberlain-mdsol commented Oct 28, 2023 •

edited

Loading

Fokko commented Nov 2, 2023

patch: Parquet Column Names with "Special Characters" fix #109

patch: Parquet Column Names with "Special Characters" fix #109

Conversation

MarquisC commented Oct 27, 2023

mchamberlain-mdsol commented Oct 27, 2023 • edited Loading

Fokko commented Oct 27, 2023

MarquisC commented Oct 28, 2023

mchamberlain-mdsol commented Oct 28, 2023 • edited Loading

Fokko commented Nov 2, 2023

mchamberlain-mdsol commented Oct 27, 2023 •

edited

Loading

mchamberlain-mdsol commented Oct 28, 2023 •

edited

Loading