Read: fetch file_schema directly from pyarrow_to_schema #597

HonahX · 2024-04-11T09:21:28Z

If we truly read by Field-IDs the names should be irrelevant, so we should probably update our mapping to ensure we correctly project by IDs.

I think we do correctly project by IDs. The real problem is the way that we sanitize the column names.
In #83, we add sanitize the file_schema in _task_to_table with the assumption that the column name of the parquet file follows the Avro Naming spec. However, I think the "sanitization" should be more general here: it should just ensure that the final file_project_schema contains the same column names of the parquet file's schema.

The names in file_schema are different from the actual column names in the parquet file because we first try to load the file schema from the json string stored in the parquet file metadata: link

file_schema = (
            Schema.model_validate_json(schema_raw) if schema_raw is not None else pyarrow_to_schema(physical_schema, name_mapping)
        )

Parquet files written by iceberg java contain this metadata json string. The json string represents the iceberg table schema at the time of writting the file. Therefore, it contains un-sanitized column names.

Since we always need to run a visitor to sanitize/ensure column names match, how about we just get the file_schema directly from the pyarrow physical schema?

file_schema = pyarrow_to_schema(physical_schema, name_mapping)

This way, we can ensure that the column names match, and thus do not need to sanitize the column names later.

I have verified that changing to this can fix both the sanitization issue in #83 and the issue here. Given that we want to align the writing behavior with the java implementation, we should also proceed #590.

Borrowed the integration test from #590

kevinjqliu

LGTM!

I want to summarize my understanding, based on the comment from #584 .

When reading the parquet files, we use the projected version of the parquet file's schema, the arrow table created is then
"casted" the Iceberg's schema. This mapping is based on field_id

iceberg-python/pyiceberg/io/pyarrow.py

Line 1026 in 5039b5d

return to_requested_schema(projected_schema, file_project_schema, arrow_table)

Fokko · 2024-04-12T20:15:08Z

pyiceberg/io/pyarrow.py

@@ -966,20 +965,15 @@ def _task_to_table(
    with fs.open_input_file(path) as fin:
        fragment = arrow_format.make_fragment(fin)
        physical_schema = fragment.physical_schema
-        schema_raw = None
-        if metadata := physical_schema.metadata:
-            schema_raw = metadata.get(ICEBERG_SCHEMA)


My initial intent was that it was probably faster to deserialize the schema, rather than run the visitor, but this shows is not worth the additional complexity :)

HonahX · 2024-04-13T00:53:16Z

Thanks @kevinjqliu and @Fokko for reviewing! Thanks @kevinjqliu for the integration test!

HonahX added 5 commits April 11, 2024 01:01

init commit

29839da

project with actual file schema

5ee09ac

add version 1 & 2

760c594

fix tests

9837faa

fix lint

12c5021

kevinjqliu approved these changes Apr 11, 2024

View reviewed changes

kevinjqliu mentioned this pull request Apr 11, 2024

Sanitized special character column name before writing to parquet #590

Merged

Fokko approved these changes Apr 12, 2024

View reviewed changes

HonahX marked this pull request as ready for review April 12, 2024 23:22

kevinjqliu and others added 3 commits April 12, 2024 17:00

add test for writing special character column name

fdc25c8

parameterize format_version

f6b81f6

cherry-pick

68ba943

HonahX added this to the PyIceberg 0.6.1 milestone Apr 13, 2024

HonahX merged commit 35d4648 into apache:main Apr 13, 2024
7 checks passed

HonahX mentioned this pull request Apr 13, 2024

[BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name #584

Closed

HonahX linked an issue Apr 13, 2024 that may be closed by this pull request

[BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name #584

Closed

HonahX added a commit to HonahX/iceberg-python that referenced this pull request Apr 13, 2024

Read: fetch file_schema directly from pyarrow_to_schema (apache#597)

068c786

HonahX mentioned this pull request Apr 13, 2024

[0.6.x] Backport #529 and #597 #605

Merged

Fokko pushed a commit that referenced this pull request Apr 14, 2024

Read: fetch file_schema directly from pyarrow_to_schema (#597)

bc6bea1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read: fetch file_schema directly from pyarrow_to_schema #597

Read: fetch file_schema directly from pyarrow_to_schema #597

HonahX commented Apr 11, 2024 •

edited

Loading

kevinjqliu left a comment •

edited

Loading

Fokko Apr 12, 2024

HonahX commented Apr 13, 2024 •

edited

Loading

Read: fetch file_schema directly from pyarrow_to_schema #597

Read: fetch file_schema directly from pyarrow_to_schema #597

Conversation

HonahX commented Apr 11, 2024 • edited Loading

kevinjqliu left a comment • edited Loading

Choose a reason for hiding this comment

Fokko Apr 12, 2024

Choose a reason for hiding this comment

HonahX commented Apr 13, 2024 • edited Loading

HonahX commented Apr 11, 2024 •

edited

Loading

kevinjqliu left a comment •

edited

Loading

HonahX commented Apr 13, 2024 •

edited

Loading