Core: Fix divide by zero when adjust split size #12201

hantangwangd · 2025-02-08T08:17:24Z

Currently, TableScanUtil.adjustSplitSize(...) can appropriately handle the situations where parameter splitSize is negative, however, it doesn't consider the situation where splitSize equals 0. That is to say, in a calculation engine like Spark, if we query a table after set it's property read.split.target-size to a negative value like -1, we will succeed and get the result, since the default read.split.adaptive-size.enabled is true. Meanwhile, if we query a table after set it's property read.split.target-size to 0, we will fail with the error message / by zero.

This PR fix the inappropriate handling of the 0 value of splitSize, ensure that both negative and zero values can be adjusted to appropriate split size.

manuzhang · 2025-02-08T09:48:38Z

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

    long splitCount = LongMath.divide(scanSize, splitSize, RoundingMode.CEILING);
-    long adjustedSplitSize = Math.max(scanSize / parallelism, Math.min(MIN_SPLIT_SIZE, splitSize));
+    adjustedSplitSize = Math.max(adjustedSplitSize, Math.min(MIN_SPLIT_SIZE, splitSize));


better to put all logic to compute adjustedSplitSize in a single function.

Thanks for your review. I have refactored to make it more clear, please take a look when available, thanks.

RussellSpitzer · 2025-02-08T16:08:39Z

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

@@ -236,6 +236,9 @@ public static long adjustSplitSize(long scanSize, int parallelism, long splitSiz
    // use the configured split size if it produces at least one split per slot
    // otherwise, adjust the split size to target parallelism with a reasonable minimum
    // increasing the split size may cause expensive spills and is not done automatically
+    if (splitSize <= 0) {


Why are we allowing this at all? Shouldn't we just throw an error on Split size less than or = to 0?

Thanks for the feedback @RussellSpitzer. I just saw that when trying to adjust the split size, negative values were adjusted to normal values, while the zero value threw a / by zero exception due to being used as a divisor. So I tried to unify their behavior, whether it's uniformly adjusting to a normal split size or uniformly throwing an error.

If you think it's more appropriate to throw an error uniformly, I can adjust it. Please let me know your thoughts, thanks a lot.

FYI, I see for some other property, like METADATA_PREVIOUS_VERSIONS_MAX, will be adjust to a normal value when it's less than or equal to 0,

int maxSize = Math.max( 1, PropertyUtil.propertyAsInt( properties, TableProperties.METADATA_PREVIOUS_VERSIONS_MAX, TableProperties.METADATA_PREVIOUS_VERSIONS_MAX_DEFAULT));

The general rule of thumb has been that we generally try not to throw errors on reading things out of metadata.json to avoid bricking on a metadata.json written by bad writer. This shouldn't be the case here since the metadata.json will parse correctly, but the invalid parameters would throw an error at runtime.

We already have this check immediately below,

private static void validatePlanningArguments(long splitSize, int lookback, long openFileCost) { Preconditions.checkArgument(splitSize > 0, "Split size must be > 0: %s", splitSize); Preconditions.checkArgument(lookback > 0, "Split planning lookback must be > 0: %s", lookback); Preconditions.checkArgument(openFileCost >= 0, "File open cost must be >= 0: %s", openFileCost); }

How is this being avoided?

When querying in Spark, the adjustSplitSize(...) is invoked earlier than TableScanUtil.planTaskGroups(...) as follows, so the illegal split size will be handle in adjustSplitSize(...) first and meet this problem. I believe in other engine, if they have the logic to adjust the split size, this would be the case as well.

TableScanUtil.planTaskGroups( CloseableIterable.withNoopClose(tasks()), adjustSplitSize(tasks(), scan.targetSplitSize()), scan.splitLookback(), scan.splitOpenFileCost());

According to my understanding of what you mean, we should simply add a checkArgument for splitSize in adjustSplitSize(...) to throw an error for illegal value, is that right?

github-actions bot added spark core labels Feb 8, 2025

manuzhang reviewed Feb 8, 2025

View reviewed changes

Core: Fix divide by zero when adjust split size

5d171d2

hantangwangd force-pushed the fix_divide_by_zero branch from 66ac001 to 5d171d2 Compare February 8, 2025 10:05

RussellSpitzer reviewed Feb 8, 2025

View reviewed changes

hantangwangd mentioned this pull request Feb 10, 2025

[Iceberg] Add table and session property for split size prestodb/presto#24417

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fix divide by zero when adjust split size #12201

Core: Fix divide by zero when adjust split size #12201

hantangwangd commented Feb 8, 2025

manuzhang Feb 8, 2025

hantangwangd Feb 8, 2025

RussellSpitzer Feb 8, 2025

hantangwangd Feb 8, 2025

hantangwangd Feb 8, 2025

RussellSpitzer Feb 8, 2025

RussellSpitzer Feb 8, 2025

hantangwangd Feb 9, 2025 •

edited

Loading

Core: Fix divide by zero when adjust split size #12201

Are you sure you want to change the base?

Core: Fix divide by zero when adjust split size #12201

Conversation

hantangwangd commented Feb 8, 2025

manuzhang Feb 8, 2025

Choose a reason for hiding this comment

hantangwangd Feb 8, 2025

Choose a reason for hiding this comment

RussellSpitzer Feb 8, 2025

Choose a reason for hiding this comment

hantangwangd Feb 8, 2025

Choose a reason for hiding this comment

hantangwangd Feb 8, 2025

Choose a reason for hiding this comment

RussellSpitzer Feb 8, 2025

Choose a reason for hiding this comment

RussellSpitzer Feb 8, 2025

Choose a reason for hiding this comment

hantangwangd Feb 9, 2025 • edited Loading

Choose a reason for hiding this comment

hantangwangd Feb 9, 2025 •

edited

Loading