Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Fokko · 2025-01-20T11:58:56Z

This was already being discussed back here: #208 (comment)

This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually.

Fixes #1491

Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The combine_chunks method does this correctly.

Now:

0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds

Before:

Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds

So it comes with a nice speedup as well :)

kevinjqliu

LGTM i left a few comments

pyiceberg/partitioning.py

kevinjqliu · 2025-01-20T17:28:34Z

tests/integration/test_writes/test_partitioned_writes.py

+    y = ["fixed_string"] * 30_000
+    tb = pa.chunked_array([y] * 10_000)
+    # Create pa.table
+    arrow_table = pa.table({"a": ta, "b": tb})


it wasnt obv to me that this test offset is beyond 32 bits, but i ran it and 4800280000 is >2^32/4294967296

>>> len(arrow_table) 300000000 >>> arrow_table.get_total_buffer_size() 4800280000

tests/benchmark/test_benchmark.py

pyiceberg/io/pyarrow.py

kevinjqliu · 2025-01-20T19:24:52Z

pyiceberg/partitioning.py

+        # When adding files, it can be that we still need to convert from logical types to physical types
+        iceberg_typed_value = _to_partition_representation(iceberg_type, value)


is this due to the fact that we already transform the partition key value

partition.transform.pyarrow_transform(source_field.field_type)(arrow_table[source_field.name])

and this expects the untransformed value?

if thats the case, can we just omit the transformation before the group_by?

Ah, of course. We want to know the output tuples after the transform, so omitting the transformation is not possible. I think we could do a follow-up PR where we split out the logic for the write path, and the add-files path. Since after this PR, this is not needed when doing partitioned writes, we just need it to preprocess when importing partitions.

Fokko · 2025-01-21T14:15:45Z

Ugh, accidentally pushed main 🤦

bigluck · 2025-01-21T16:14:18Z

:'(

Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <[email protected]>

Fokko force-pushed the fd-fix-overflowing-buffer branch 2 times, most recently from e548117 to 4658c3c Compare January 20, 2025 13:25

Fokko mentioned this pull request Jan 20, 2025

[Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows) #1491

Closed

3 tasks

kevinjqliu approved these changes Jan 20, 2025

View reviewed changes

Fokko force-pushed the fd-fix-overflowing-buffer branch from 04a8218 to 3841fe7 Compare January 20, 2025 19:15

kevinjqliu reviewed Jan 20, 2025

View reviewed changes

Fokko closed this Jan 21, 2025

Fokko force-pushed the fd-fix-overflowing-buffer branch from 3841fe7 to c84dd8d Compare January 21, 2025 14:02

Fokko mentioned this pull request Jan 21, 2025

PyArrow: Avoid buffer-overflow by avoid doing a sort #1555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Fokko commented Jan 20, 2025

kevinjqliu left a comment

kevinjqliu Jan 20, 2025

kevinjqliu Jan 20, 2025

Fokko Jan 21, 2025

Fokko commented Jan 21, 2025

bigluck commented Jan 21, 2025

		# When adding files, it can be that we still need to convert from logical types to physical types
		iceberg_typed_value = _to_partition_representation(iceberg_type, value)

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Conversation

Fokko commented Jan 20, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Jan 20, 2025

Choose a reason for hiding this comment

kevinjqliu Jan 20, 2025

Choose a reason for hiding this comment

Fokko Jan 21, 2025

Choose a reason for hiding this comment

Fokko commented Jan 21, 2025

bigluck commented Jan 21, 2025