Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Arrow schema conversion #117

Merged
merged 3 commits into from
Nov 3, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Nov 2, 2023

We wrapped a schema in a schema.

We wrapped a schema in a schema.
@Fokko Fokko marked this pull request as ready for review November 2, 2023 12:16
@Fokko Fokko changed the title Refactor schema conversion Refactor Arrow schema conversion Nov 2, 2023
Copy link
Contributor

@bitsondatadev bitsondatadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits

@@ -708,15 +709,17 @@ def _write_table_to_file(filepath: str, schema: pa.Schema, table: pa.Table) -> s

@pytest.fixture
def file_int(schema_int: Schema, tmpdir: str) -> str:
pyarrow_schema = pa.schema(schema_to_pyarrow(schema_int), metadata={"iceberg.schema": schema_int.model_dump_json()})
pyarrow_schema = schema_to_pyarrow(schema_int, metadata={ICEBERG_SCHEMA: bytes(schema_int.model_dump_json(), 'utf-8')})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this string be using a constant in a lib somewhere? Or at we could least create an encodings class that centralizes all the schema stuff (e.g. create a constant for 'utf-8', hides ICEBERG_SCHEMA and expose some cleaner methods that hides the bytes conversion, etc...

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've introduced a utf8 constant 👍

def schema_to_pyarrow(schema: Union[Schema, IcebergType]) -> pa.schema:
return visit(schema, _ConvertToArrowSchema())
def schema_to_pyarrow(schema: Union[Schema, IcebergType], metadata: Dict[bytes, bytes] = EMPTY_DICT) -> pa.schema:
return visit(schema, _ConvertToArrowSchema(metadata))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the visit() behavior with an empty dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometime we use the visitor to convert types, then we don't need to set any metadata so then a default with an empty dict makes things easier and less verbose.

Copy link
Contributor

@bitsondatadev bitsondatadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Fokko Fokko merged commit 9189cb3 into apache:main Nov 3, 2023
@Fokko Fokko deleted the fd-refactor-schame-convertion branch November 3, 2023 15:52
@Fokko
Copy link
Contributor Author

Fokko commented Nov 3, 2023

Thanks @bitsondatadev and @amogh-jahagirdar for the review 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants