Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch: Parquet Column Names with "Special Characters" fix #109

Closed
wants to merge 5 commits into from
Closed

patch: Parquet Column Names with "Special Characters" fix #109

wants to merge 5 commits into from

Conversation

MarquisC
Copy link

We're using PyIceberg to read Iceberg tables stored in S3 as parquet. We have column names in the form of id:foo diagnostic:bar using : as a sort of delimiter to help us do some programatic maintenance on our side.

In Parquet the column names are magically subbed in this case : -> _x3A and upon attempts at scanning/reading the data the schema of the table doesn't match the physical column names for PyArrow.

The first pass is a naive fix for this that I have tested and works, but I'm looking for guidance on where you all want me to put this logic, and I'm happy to add it there instead.

@mchamberlain-mdsol
Copy link

mchamberlain-mdsol commented Oct 27, 2023

Exception Example (before hack):

No match for FieldRef.Name(prefix:foo) in prefix_x3Afoo: string not null
... a bunch of other columns 
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

After the hack I can read the dataframe, would love some guidance on where you all think something like this is most appropiate.

@Fokko
Copy link
Contributor

Fokko commented Oct 27, 2023

Thanks for raising this @MarquisC. This looks like #83, can you check if that also resolves your problem? Otherwise, I think it will be a good place to add it here as well. It also shows how to test this.

@MarquisC
Copy link
Author

Will test and close, unit test wise looks like #83 is the fix.

@mchamberlain-mdsol
Copy link

mchamberlain-mdsol commented Oct 28, 2023

Will test and close, unit test wise looks like #83 is the fix.

This PR can be closed, confirmed fix is in master.

@Fokko
Copy link
Contributor

Fokko commented Nov 2, 2023

Thanks you both @MarquisC and @mchamberlain-mdsol for checking this!

@Fokko Fokko closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants