-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use write.parquet.compression-{codec,level}
#358
Conversation
tests/integration/test_writes.py
Outdated
compression = parquet_metadata.row_group(0).column(0).compression | ||
|
||
if compression_codec == "<not set>": | ||
assert compression == "ZSTD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah just as you noted in the PR description, this issue is interesting @jonashaag ! I'm not sure at a first glance how this is being set as ZSTD by default.
Since ParquetWriter compression default is 'snappy' and Iceberg's is 'gzip', I think we will have to set the default value when parsing the properties regardless, just like you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug in the documentation :( See apache/iceberg-docs#305. Since Java 1.4 it has changed to zstd. But that fix in the documentation has not yet been released (will be part of 1.5).
@jonashaag thank you for raising the issue and putting this PR together so quickly! We are very excited to group this fix in with the impending 0.6.0 release. I've left some comments on the PR - let's also wait for reviews from @Fokko and others, and hopefully get this merged in soon 😄 |
Can you start CI @syun64? |
The default parquet compression is ZSTD at least for the integration test purpose, since the integration tests are using a Java based REST catalog and when the metadata.json file gets written, we'll persist ZSTD by default for Parquet files apache/iceberg@2e291c2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions in line with review comment to use write.parquet.compression-codec as table property instead of catalog property
I've changed the properties to be table properties and added handling for some other Parquet properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @jonashaag Thanks for jumping on this right away. Once this is in we can do another release candidate 👍
Could you add this to the docs? This is in mkdocs/docs/
. I don't think we have a table for this today, but I think we should add it under the paragraph of the file-io: https://py.iceberg.apache.org/configuration/#fileio
else: | ||
return int(value) | ||
|
||
for key_pattern in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to blow up if one of the properties isn't set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to raise if one of the properties is set. But I guess we should check for None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None is not allowed, reverted my changes
Co-authored-by: Fokko Driesprong <[email protected]>
Sorry I don't feel comfortable writing documentation because I still lack a lot of Iceberg understanding and terminology. Could you do that part please? |
Sure thing, no problem at all 👍
This is a bug in the documentation. See apache/iceberg-docs#305. Since Java 1.4 it has changed to zstd. But that fix in the documentation has not yet been released (will be part of 1.5). |
Sweet, ready to merge from my POV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for picking this up on such a short notice @jonashaag
fill_parquet_file_metadata( | ||
data_file=data_file, | ||
parquet_metadata=collected_metrics[0], | ||
parquet_metadata=writer.writer.metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this through the debugger, and this looks good. Nice change @jonashaag 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also tell from the PyArrow code that it's identical :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! @jonashaag. Overall LGTM! Just have one comment about the default value.
tests/integration/test_writes.py
Outdated
|
||
@pytest.mark.integration | ||
@pytest.mark.integration | ||
@pytest.mark.integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we have 2 extra @pytest.mark.integration
tests/integration/test_writes.py
Outdated
|
||
|
||
@pytest.mark.integration | ||
@pytest.mark.integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.integration |
pyiceberg/io/pyarrow.py
Outdated
if unsupported_keys := fnmatch.filter(table_properties, key_pattern): | ||
raise NotImplementedError(f"Parquet writer option(s) {unsupported_keys} not implemented") | ||
|
||
compression_codec = table_properties.get("write.parquet.compression-codec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression_codec = table_properties.get("write.parquet.compression-codec") | |
compression_codec = table_properties.get("write.parquet.compression-codec", "zstd") |
How about adding the default value here? RestCatalog backend and HiveCatalog explicitly set the default codec at catalog level.
iceberg-python/pyiceberg/catalog/hive.py
Line 158 in 02e6430
DEFAULT_PROPERTIES = {'write.parquet.compression-codec': 'zstd'} |
But other catalogs, such as glue
and sql
, do not set this explicitly when creating new tables. In general, for tables that have no write.parquet.compression-codec
key in its property, we still want to use the default codec zstd
when writing parquet.
I had to change the
metadata_collector
code due to dask/dask#7977.For the
<not set>
case (no compression specified) the tests currently pass locally but they shouldn't as we never set zstd as the default. Not sure what's going on?Should we default to gzip which is the default for Iceberg according to this document, or should we use a more reasonable and modern default like zstd level 3?