Spark: Support singular form of years, months, days, and hours functions #12117

wypoon · 2025-01-27T23:39:05Z

Since Iceberg 1.4 (#8192), Spark has supported using the singular form (year, month, day, hour) of partition transforms to be consistent with the spec, while still supporting the plural form of those transforms for backward compatibility.
However, SparkFunctions still supported only the plural form (years, months, days, hours) for functions that can be used in queries.
For consistency, we add the singular form of those functions so the singular form can be used in queries as well.

ebyhr · 2025-01-28T01:26:38Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java

+          .put("years", new YearsFunction())
+          .put("year", new YearsFunction())
+          .put("months", new MonthsFunction())
+          .put("month", new MonthsFunction())
+          .put("days", new DaysFunction())
+          .put("day", new DaysFunction())
+          .put("hours", new HoursFunction())
+          .put("hour", new HoursFunction())


These function classes contain examples with plural styles in javadoc. Can we add another example or just update them?

That's a good point. Let me update the javadoc of those functions to add examples with the singular form.
The functions also have name() methods that return the plural form. I'm not sure if we should change the name though. For now, my thinking is to leave the name alone, but just to support using the singular form (and documenting the use in javadoc). If there is a doc page that should be updated, I'll be happy to update it if you let me know.

ebyhr · 2025-01-28T01:30:44Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkDaysFunction.java

@@ -39,6 +39,9 @@ public void testDates() {
    assertThat(scalarSql("SELECT system.days(date('2017-12-01'))"))
        .as("Expected to produce 2017-12-01")
        .isEqualTo(Date.valueOf("2017-12-01"));
+    assertThat(scalarSql("SELECT system.day(date('2017-12-01'))"))


Can we also update negative test cases? e.g. testWrongNumberOfArguments
The error message always uses plural names because function name is hard-coded in name() and canonicalName() methods in each function class. We may want to add a constructor taking the function name to report the correct function name.

I didn't repeat all the test cases, as it does not seem necessary. I just added a sample of each (posititive) case with the singular form to demonstrate that the form can be used.

It hides the fact that these functions can throw different function names which users specified. By the way, I'm not requesting repeating all the test cases.

Ok, I see what you're getting at.
I decided to add constructors to the functions so that we know what form (singular or plural) is being used; plural is the default to preserve existing behavior.

ebyhr · 2025-01-28T01:32:11Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java

+          .put("years", new YearsFunction())
+          .put("year", new YearsFunction())


We could extract a constant instead of initializing two instances.
(This comment might be conflicted with my another comment in TestSparkDaysFunction)

Also, use constants for the functions in SparkFunctions.

wypoon · 2025-01-28T02:54:13Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/functions/DaysFunction.java

@@ -31,6 +31,8 @@
 * A Spark function implementation for the Iceberg day transform.
 *
 * <p>Example usage: {@code SELECT system.days('source_col')}.
+ *
+ * <p>Alternate form: {@code SELECT system.day('source_col')}.
 */
 public class DaysFunction extends UnaryUnboundFunction {


@ebyhr if I understand you correctly, you also suggest introducing a constructor that takes a String name and then have name() return this name, right?
So new DaysFunction("day") would return "day" when its name() is called, while new DaysFunction("days") would return "days".
I think that introduces more complexity than it's worth.

We don't want the function to be constructed with an arbitrary name. We really just want the name to be either "day" or "days" in this case. So we can have a constructor with a boolean.

wypoon · 2025-01-28T19:38:28Z

@Fokko @nastra would you mind reviewing this?

…forms. Singular is the default.

Add tests for using the functions in queries.

wypoon · 2025-01-31T03:09:16Z

@ebyhr thank you for your reviews.
@szehon-ho @amogh-jahagirdar would you mind taking a look? It would be nice to get this into 1.8.0.

wypoon · 2025-01-31T05:57:59Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java

-          YearsFunction.class, new YearsFunction(),
-          MonthsFunction.class, new MonthsFunction(),
-          DaysFunction.class, new DaysFunction(),
-          HoursFunction.class, new HoursFunction(),
-          BucketFunction.class, new BucketFunction(),
-          TruncateFunction.class, new TruncateFunction());
+          YearsFunction.class, YEARS_FUNCTION,
+          MonthsFunction.class, MONTHS_FUNCTION,
+          DaysFunction.class, DAYS_FUNCTION,
+          HoursFunction.class, HOURS_FUNCTION,
+          BucketFunction.class, BUCKET_FUNCTION,
+          TruncateFunction.class, TRUNCATE_FUNCTION);


I had to make a choice here. CLASS_TO_FUNCTIONS is used by loadFunctionsByClass(Class) and that is called by the ReplaceStaticInvoke rule. The StaticInvoke contains a class and that is all we have to go by; we do not know if the class belongs to an instance with singular equal to true or false. So I map the classes to function instances of the plural form (to preserve existing behavior).
I didn't feel it worthwhile to have two classes instead of one to distinguish between singular and plural forms.
The tradeoff is that if we inspect the internals of a Spark LogicalPlan and we look at an ApplyFunctionExpression corresponding to e.g., applying YearsFunction.TimestampToYearsFunction, and asks it for its name, we will get "years" (which comes from YearsFunction.TimestampToYearsFunction::name()) regardless of whether we called year or years in our SQL query. (This happens in the verification part of some tests.)

Without the changes in SparkV2Filters, changing CLASS_TO_FUNCTIONS to use the singular forms would cause test failures in TestSystemFunctionPushDownInRowLevelOperations and TestSystemFunctionPushDownDQL (even after accounting for the name when verifying the ApplyFunctionExpression).
However, with the changes in SparkV2Filters, we are free to choose the singular forms as standard if we wish.

wypoon · 2025-01-31T20:11:18Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

-      ImmutableSet.of("years", "months", "days", "hours", "bucket", "truncate");
+      ImmutableSet.of(
+          "year", "years", "month", "months", "day", "days", "hour", "hours", "bucket", "truncate");


I wasn't aware of the hardcoded references to "years", "months", "days" and "hours" in SparkV2Filters.
The changes here in SparkV2Filters are only necessary if we want to make the singular forms the standard.
Right now, I have left the plural forms as standard.
@rdblue should we make the singular form of the functions the standard?

Hey @wypoon thanks for working on this. Initially, we went for the plural form because the singular form is already a function in Spark (e.g. day), do you know if these interfere? We should also make sure that we convert this in the tests.

@Fokko thanks for looking at this, and the explanation for why the plural form was used.
There is no interference between the Iceberg functions and the Spark built-in functions, because the Iceberg functions use the "system" namespace. I added tests for the built-in functions to demonstrate this.
There is one thing that gave me pause, which is the comment in BaseCatalog::isFunctionNamespace:

// Allow for empty namespace, as Spark's storage partitioned joins look up // the corresponding functions to generate transforms for partitioning // with an empty namespace, such as `bucket`. // Otherwise, use `system` namespace.

For this reason, I added some variants to TestStoragePartitionJoins to use the singular forms of the transforms. Those tests pass too.

wypoon · 2025-02-04T00:07:50Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkHoursFunction.java

+    String tz = TimeZone.getDefault().getID();
+    assertThat(scalarSql(String.format("SELECT hour(TIMESTAMP '2017-12-01 10:12:55 %s')", tz)))
+        .as("Expected to produce 10")
+        .isEqualTo(10);


Note: This is how Spark SQL's TIMESTAMP_LTZ support works. The instance in time is converted to the local timezone and the hour function is applied to that. Thus if the instance in time is already specified using the local timezone, when converted the hour hasn't changed, so we know what it is.

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkDaysFunction.java

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkMonthsFunction.java

... that were not bullet-proof.

Spark: Support singular form of years, months, days, and hours functions

87c408b

github-actions bot added the spark label Jan 27, 2025

ebyhr reviewed Jan 28, 2025

View reviewed changes

Add alternate forms to javadoc of functions.

ccf99c7

Also, use constants for the functions in SparkFunctions.

wypoon commented Jan 28, 2025

View reviewed changes

wypoon added 2 commits January 30, 2025 13:28

Add constructors to construct the functions with singular and plural …

d1e6c4b

…forms. Singular is the default.

Make plural form the standard.

7f6ed48

Add tests for using the functions in queries.

wypoon commented Jan 31, 2025

View reviewed changes

Allow singular forms to become standard, if we so choose.

9ed424e

wypoon commented Jan 31, 2025

View reviewed changes

More tests.

2ea1c19

wypoon commented Feb 4, 2025

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkDaysFunction.java Outdated Show resolved Hide resolved

wypoon commented Feb 4, 2025

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkMonthsFunction.java Outdated Show resolved Hide resolved

Fix tests

519815a

... that were not bullet-proof.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Support singular form of years, months, days, and hours functions #12117

Spark: Support singular form of years, months, days, and hours functions #12117

wypoon commented Jan 27, 2025 •

edited

Loading

ebyhr Jan 28, 2025

wypoon Jan 28, 2025

ebyhr Jan 28, 2025 •

edited

Loading

wypoon Jan 28, 2025

ebyhr Jan 30, 2025 •

edited

Loading

wypoon Jan 30, 2025 •

edited

Loading

ebyhr Jan 28, 2025

wypoon Jan 28, 2025 •

edited

Loading

wypoon Jan 30, 2025 •

edited

Loading

wypoon commented Jan 28, 2025

wypoon commented Jan 31, 2025

wypoon Jan 31, 2025 •

edited

Loading

wypoon Jan 31, 2025

wypoon Jan 31, 2025

Fokko Feb 3, 2025

wypoon Feb 3, 2025

wypoon Feb 4, 2025

		.put("years", new YearsFunction())
		.put("year", new YearsFunction())

Spark: Support singular form of years, months, days, and hours functions #12117

Are you sure you want to change the base?

Spark: Support singular form of years, months, days, and hours functions #12117

Conversation

wypoon commented Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

wypoon Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wypoon Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

wypoon Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

wypoon commented Jan 28, 2025

wypoon commented Jan 31, 2025

wypoon Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wypoon commented Jan 27, 2025 •

edited

Loading

ebyhr Jan 28, 2025 •

edited

Loading

ebyhr Jan 30, 2025 •

edited

Loading

wypoon Jan 30, 2025 •

edited

Loading

wypoon Jan 28, 2025 •

edited

Loading

wypoon Jan 30, 2025 •

edited

Loading

wypoon Jan 31, 2025 •

edited

Loading