Use `write.parquet.compression-{codec,level}` #358

jonashaag · 2024-02-03T16:45:13Z

I had to change the metadata_collector code due to dask/dask#7977.

For the <not set> case (no compression specified) the tests currently pass locally but they shouldn't as we never set zstd as the default. Not sure what's going on?

Should we default to gzip which is the default for Iceberg according to this document, or should we use a more reasonable and modern default like zstd level 3?

pyiceberg/io/pyarrow.py

tests/integration/test_writes.py

pyiceberg/io/pyarrow.py

sungwy · 2024-02-03T19:09:22Z

tests/integration/test_writes.py

+        compression = parquet_metadata.row_group(0).column(0).compression
+
+    if compression_codec == "<not set>":
+        assert compression == "ZSTD"


Yeah just as you noted in the PR description, this issue is interesting @jonashaag ! I'm not sure at a first glance how this is being set as ZSTD by default.

Since ParquetWriter compression default is 'snappy' and Iceberg's is 'gzip', I think we will have to set the default value when parsing the properties regardless, just like you suggested.

This is a bug in the documentation :( See apache/iceberg-docs#305. Since Java 1.4 it has changed to zstd. But that fix in the documentation has not yet been released (will be part of 1.5).

sungwy · 2024-02-03T19:11:42Z

@jonashaag thank you for raising the issue and putting this PR together so quickly! We are very excited to group this fix in with the impending 0.6.0 release. I've left some comments on the PR - let's also wait for reviews from @Fokko and others, and hopefully get this merged in soon 😄

jonashaag · 2024-02-03T19:22:14Z

Can you start CI @syun64?

amogh-jahagirdar · 2024-02-03T20:34:46Z

For the case (no compression specified) the tests currently pass locally but they shouldn't as we never set zstd as the default

The default parquet compression is ZSTD at least for the integration test purpose, since the integration tests are using a Java based REST catalog and when the metadata.json file gets written, we'll persist ZSTD by default for Parquet files apache/iceberg@2e291c2

pyiceberg/catalog/rest.py

sungwy

Suggestions in line with review comment to use write.parquet.compression-codec as table property instead of catalog property

tests/integration/test_writes.py

pyiceberg/catalog/rest.py

jonashaag · 2024-02-04T19:33:02Z

I've changed the properties to be table properties and added handling for some other Parquet properties

Fokko

This looks great @jonashaag Thanks for jumping on this right away. Once this is in we can do another release candidate 👍

Could you add this to the docs? This is in mkdocs/docs/. I don't think we have a table for this today, but I think we should add it under the paragraph of the file-io: https://py.iceberg.apache.org/configuration/#fileio

pyiceberg/io/pyarrow.py

Fokko · 2024-02-04T19:50:48Z

pyiceberg/io/pyarrow.py

+        else:
+            return int(value)
+
+    for key_pattern in [


Do we want to blow up if one of the properties isn't set?

We want to raise if one of the properties is set. But I guess we should check for None

None is not allowed, reverted my changes

Co-authored-by: Fokko Driesprong <[email protected]>

jonashaag · 2024-02-04T20:03:07Z

Sorry I don't feel comfortable writing documentation because I still lack a lot of Iceberg understanding and terminology. Could you do that part please?

Fokko · 2024-02-04T20:25:44Z

Sorry I don't feel comfortable writing documentation because I still lack a lot of Iceberg understanding and terminology. Could you do that part please?

Sure thing, no problem at all 👍

Should we default to gzip which is the default for Iceberg according to this document, or should we use a more reasonable and modern default like zstd level 3?

This is a bug in the documentation. See apache/iceberg-docs#305. Since Java 1.4 it has changed to zstd. But that fix in the documentation has not yet been released (will be part of 1.5).

jonashaag · 2024-02-04T20:55:46Z

Sweet, ready to merge from my POV

Fokko

Looks good. Thanks for picking this up on such a short notice @jonashaag

Fokko · 2024-02-04T21:22:04Z

pyiceberg/io/pyarrow.py

    fill_parquet_file_metadata(
        data_file=data_file,
-        parquet_metadata=collected_metrics[0],
+        parquet_metadata=writer.writer.metadata,


I checked this through the debugger, and this looks good. Nice change @jonashaag 👍

You can also tell from the PyArrow code that it's identical :)

HonahX

Thanks for the great work! @jonashaag. Overall LGTM! Just have one comment about the default value.

HonahX · 2024-02-04T21:28:36Z

tests/integration/test_writes.py

+
+@pytest.mark.integration
+@pytest.mark.integration
+@pytest.mark.integration


Seems we have 2 extra @pytest.mark.integration

HonahX · 2024-02-04T21:29:58Z

tests/integration/test_writes.py

+
+
+@pytest.mark.integration
+@pytest.mark.integration


Suggested change

@pytest.mark.integration

HonahX · 2024-02-04T22:30:14Z

pyiceberg/io/pyarrow.py

+        if unsupported_keys := fnmatch.filter(table_properties, key_pattern):
+            raise NotImplementedError(f"Parquet writer option(s) {unsupported_keys} not implemented")
+
+    compression_codec = table_properties.get("write.parquet.compression-codec")


Suggested change

compression_codec = table_properties.get("write.parquet.compression-codec")

compression_codec = table_properties.get("write.parquet.compression-codec", "zstd")

How about adding the default value here? RestCatalog backend and HiveCatalog explicitly set the default codec at catalog level.

iceberg-python/pyiceberg/catalog/hive.py

Line 158 in 02e6430

DEFAULT_PROPERTIES = {'write.parquet.compression-codec': 'zstd'}

But other catalogs, such as glue and sql, do not set this explicitly when creating new tables. In general, for tables that have no write.parquet.compression-codec key in its property, we still want to use the default codec zstd when writing parquet.

jonashaag added 2 commits February 3, 2024 17:39

Use write.parquet.compression-{codec,level}

9544e94

Cleanup

e385e77

sungwy reviewed Feb 3, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

sungwy reviewed Feb 3, 2024

View reviewed changes

tests/integration/test_writes.py Outdated Show resolved Hide resolved

sungwy reviewed Feb 3, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

sungwy reviewed Feb 3, 2024

View reviewed changes

pyiceberg/catalog/rest.py Outdated Show resolved Hide resolved

Review feedback

3e678d1

sungwy reviewed Feb 4, 2024

View reviewed changes

tests/integration/test_writes.py Outdated Show resolved Hide resolved

tests/integration/test_writes.py Outdated Show resolved Hide resolved

pyiceberg/catalog/rest.py Outdated Show resolved Hide resolved

jonashaag added 2 commits February 4, 2024 20:22

Review feedback

029aabe

Review feedback

4ea76fb

Fokko reviewed Feb 4, 2024

View reviewed changes

jonashaag and others added 2 commits February 4, 2024 20:53

Update pyiceberg/io/pyarrow.py

6536f8e

Co-authored-by: Fokko Driesprong <[email protected]>

Fixup

809ca2e

Fixup

8ea5ad5

Fokko approved these changes Feb 4, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.6.0 release milestone Feb 4, 2024

HonahX self-requested a review February 4, 2024 22:16

HonahX reviewed Feb 4, 2024

View reviewed changes

HonahX mentioned this pull request Feb 4, 2024

Implement Centralized Management of Table Properties #365

Closed

Fixup

517f2e8

Fokko merged commit 9e4ed29 into apache:main Feb 5, 2024
6 checks passed

Fokko mentioned this pull request Feb 5, 2024

pyiceberg.io.pyarrow.write_file does not take into account compression settings #345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `write.parquet.compression-{codec,level}` #358

Use `write.parquet.compression-{codec,level}` #358

jonashaag commented Feb 3, 2024 •

edited

Loading

sungwy Feb 3, 2024

Fokko Feb 4, 2024

sungwy commented Feb 3, 2024

jonashaag commented Feb 3, 2024

amogh-jahagirdar commented Feb 3, 2024

sungwy left a comment

jonashaag commented Feb 4, 2024 •

edited

Loading

Fokko left a comment

Fokko Feb 4, 2024

jonashaag Feb 4, 2024

jonashaag Feb 4, 2024

jonashaag commented Feb 4, 2024

Fokko commented Feb 4, 2024

jonashaag commented Feb 4, 2024

Fokko left a comment

Fokko Feb 4, 2024

jonashaag Feb 4, 2024

HonahX left a comment

HonahX Feb 4, 2024

HonahX Feb 4, 2024

HonahX Feb 4, 2024

	compression_codec = table_properties.get("write.parquet.compression-codec")
	compression_codec = table_properties.get("write.parquet.compression-codec", "zstd")

Use write.parquet.compression-{codec,level} #358

Use write.parquet.compression-{codec,level} #358

Conversation

jonashaag commented Feb 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Feb 3, 2024

jonashaag commented Feb 3, 2024

amogh-jahagirdar commented Feb 3, 2024

sungwy left a comment

Choose a reason for hiding this comment

jonashaag commented Feb 4, 2024 • edited Loading

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonashaag commented Feb 4, 2024

Fokko commented Feb 4, 2024

jonashaag commented Feb 4, 2024

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use `write.parquet.compression-{codec,level}` #358

Use `write.parquet.compression-{codec,level}` #358

jonashaag commented Feb 3, 2024 •

edited

Loading

jonashaag commented Feb 4, 2024 •

edited

Loading