Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use write.parquet.compression-{codec,level} #358

Merged
merged 9 commits into from
Feb 5, 2024

Conversation

jonashaag
Copy link
Contributor

@jonashaag jonashaag commented Feb 3, 2024

I had to change the metadata_collector code due to dask/dask#7977.

For the <not set> case (no compression specified) the tests currently pass locally but they shouldn't as we never set zstd as the default. Not sure what's going on?

Should we default to gzip which is the default for Iceberg according to this document, or should we use a more reasonable and modern default like zstd level 3?

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
compression = parquet_metadata.row_group(0).column(0).compression

if compression_codec == "<not set>":
assert compression == "ZSTD"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just as you noted in the PR description, this issue is interesting @jonashaag ! I'm not sure at a first glance how this is being set as ZSTD by default.

Since ParquetWriter compression default is 'snappy' and Iceberg's is 'gzip', I think we will have to set the default value when parsing the properties regardless, just like you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug in the documentation :( See apache/iceberg-docs#305. Since Java 1.4 it has changed to zstd. But that fix in the documentation has not yet been released (will be part of 1.5).

@sungwy
Copy link
Collaborator

sungwy commented Feb 3, 2024

@jonashaag thank you for raising the issue and putting this PR together so quickly! We are very excited to group this fix in with the impending 0.6.0 release. I've left some comments on the PR - let's also wait for reviews from @Fokko and others, and hopefully get this merged in soon 😄

@jonashaag
Copy link
Contributor Author

Can you start CI @syun64?

@amogh-jahagirdar
Copy link
Contributor

For the case (no compression specified) the tests currently pass locally but they shouldn't as we never set zstd as the default

The default parquet compression is ZSTD at least for the integration test purpose, since the integration tests are using a Java based REST catalog and when the metadata.json file gets written, we'll persist ZSTD by default for Parquet files apache/iceberg@2e291c2

pyiceberg/catalog/rest.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions in line with review comment to use write.parquet.compression-codec as table property instead of catalog property

tests/integration/test_writes.py Outdated Show resolved Hide resolved
tests/integration/test_writes.py Outdated Show resolved Hide resolved
pyiceberg/catalog/rest.py Outdated Show resolved Hide resolved
@jonashaag
Copy link
Contributor Author

jonashaag commented Feb 4, 2024

I've changed the properties to be table properties and added handling for some other Parquet properties

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @jonashaag Thanks for jumping on this right away. Once this is in we can do another release candidate 👍

Could you add this to the docs? This is in mkdocs/docs/. I don't think we have a table for this today, but I think we should add it under the paragraph of the file-io: https://py.iceberg.apache.org/configuration/#fileio

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
else:
return int(value)

for key_pattern in [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to blow up if one of the properties isn't set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to raise if one of the properties is set. But I guess we should check for None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is not allowed, reverted my changes

jonashaag and others added 2 commits February 4, 2024 20:53
@jonashaag
Copy link
Contributor Author

Sorry I don't feel comfortable writing documentation because I still lack a lot of Iceberg understanding and terminology. Could you do that part please?

@Fokko
Copy link
Contributor

Fokko commented Feb 4, 2024

Sorry I don't feel comfortable writing documentation because I still lack a lot of Iceberg understanding and terminology. Could you do that part please?

Sure thing, no problem at all 👍

Should we default to gzip which is the default for Iceberg according to this document, or should we use a more reasonable and modern default like zstd level 3?

This is a bug in the documentation. See apache/iceberg-docs#305. Since Java 1.4 it has changed to zstd. But that fix in the documentation has not yet been released (will be part of 1.5).

@jonashaag
Copy link
Contributor Author

Sweet, ready to merge from my POV

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for picking this up on such a short notice @jonashaag

fill_parquet_file_metadata(
data_file=data_file,
parquet_metadata=collected_metrics[0],
parquet_metadata=writer.writer.metadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this through the debugger, and this looks good. Nice change @jonashaag 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also tell from the PyArrow code that it's identical :)

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Feb 4, 2024
@HonahX HonahX self-requested a review February 4, 2024 22:16
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! @jonashaag. Overall LGTM! Just have one comment about the default value.


@pytest.mark.integration
@pytest.mark.integration
@pytest.mark.integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we have 2 extra @pytest.mark.integration



@pytest.mark.integration
@pytest.mark.integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.integration

if unsupported_keys := fnmatch.filter(table_properties, key_pattern):
raise NotImplementedError(f"Parquet writer option(s) {unsupported_keys} not implemented")

compression_codec = table_properties.get("write.parquet.compression-codec")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compression_codec = table_properties.get("write.parquet.compression-codec")
compression_codec = table_properties.get("write.parquet.compression-codec", "zstd")

How about adding the default value here? RestCatalog backend and HiveCatalog explicitly set the default codec at catalog level.

DEFAULT_PROPERTIES = {'write.parquet.compression-codec': 'zstd'}

But other catalogs, such as glue and sql, do not set this explicitly when creating new tables. In general, for tables that have no write.parquet.compression-codec key in its property, we still want to use the default codec zstd when writing parquet.

@Fokko Fokko merged commit 9e4ed29 into apache:main Feb 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants