Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Support singular form of years, months, days, and hours functions #12117

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Jan 27, 2025

Since Iceberg 1.4 (#8192), Spark has supported using the singular form (year, month, day, hour) of partition transforms to be consistent with the spec, while still supporting the plural form of those transforms for backward compatibility.
However, SparkFunctions still supported only the plural form (years, months, days, hours) for functions that can be used in queries.
For consistency, we add the singular form of those functions so the singular form can be used in queries as well.

@github-actions github-actions bot added the spark label Jan 27, 2025
Comment on lines 35 to 42
.put("years", new YearsFunction())
.put("year", new YearsFunction())
.put("months", new MonthsFunction())
.put("month", new MonthsFunction())
.put("days", new DaysFunction())
.put("day", new DaysFunction())
.put("hours", new HoursFunction())
.put("hour", new HoursFunction())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These function classes contain examples with plural styles in javadoc. Can we add another example or just update them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let me update the javadoc of those functions to add examples with the singular form.
The functions also have name() methods that return the plural form. I'm not sure if we should change the name though. For now, my thinking is to leave the name alone, but just to support using the singular form (and documenting the use in javadoc). If there is a doc page that should be updated, I'll be happy to update it if you let me know.

@@ -39,6 +39,9 @@ public void testDates() {
assertThat(scalarSql("SELECT system.days(date('2017-12-01'))"))
.as("Expected to produce 2017-12-01")
.isEqualTo(Date.valueOf("2017-12-01"));
assertThat(scalarSql("SELECT system.day(date('2017-12-01'))"))
Copy link
Contributor

@ebyhr ebyhr Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also update negative test cases? e.g. testWrongNumberOfArguments
The error message always uses plural names because function name is hard-coded in name() and canonicalName() methods in each function class. We may want to add a constructor taking the function name to report the correct function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't repeat all the test cases, as it does not seem necessary. I just added a sample of each (posititive) case with the singular form to demonstrate that the form can be used.

Copy link
Contributor

@ebyhr ebyhr Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hides the fact that these functions can throw different function names which users specified. By the way, I'm not requesting repeating all the test cases.

Copy link
Contributor Author

@wypoon wypoon Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you're getting at.
I decided to add constructors to the functions so that we know what form (singular or plural) is being used; plural is the default to preserve existing behavior.

Comment on lines 35 to 36
.put("years", new YearsFunction())
.put("year", new YearsFunction())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extract a constant instead of initializing two instances.
(This comment might be conflicted with my another comment in TestSparkDaysFunction)

Also, use constants for the functions in SparkFunctions.
@@ -31,6 +31,8 @@
* A Spark function implementation for the Iceberg day transform.
*
* <p>Example usage: {@code SELECT system.days('source_col')}.
*
* <p>Alternate form: {@code SELECT system.day('source_col')}.
*/
public class DaysFunction extends UnaryUnboundFunction {
Copy link
Contributor Author

@wypoon wypoon Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr if I understand you correctly, you also suggest introducing a constructor that takes a String name and then have name() return this name, right?
So new DaysFunction("day") would return "day" when its name() is called, while new DaysFunction("days") would return "days".
I think that introduces more complexity than it's worth.

Copy link
Contributor Author

@wypoon wypoon Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the function to be constructed with an arbitrary name. We really just want the name to be either "day" or "days" in this case. So we can have a constructor with a boolean.

@wypoon
Copy link
Contributor Author

wypoon commented Jan 28, 2025

@Fokko @nastra would you mind reviewing this?

Add tests for using the functions in queries.
@wypoon
Copy link
Contributor Author

wypoon commented Jan 31, 2025

@ebyhr thank you for your reviews.
@szehon-ho @amogh-jahagirdar would you mind taking a look? It would be nice to get this into 1.8.0.

Comment on lines -44 to +66
YearsFunction.class, new YearsFunction(),
MonthsFunction.class, new MonthsFunction(),
DaysFunction.class, new DaysFunction(),
HoursFunction.class, new HoursFunction(),
BucketFunction.class, new BucketFunction(),
TruncateFunction.class, new TruncateFunction());
YearsFunction.class, YEARS_FUNCTION,
MonthsFunction.class, MONTHS_FUNCTION,
DaysFunction.class, DAYS_FUNCTION,
HoursFunction.class, HOURS_FUNCTION,
BucketFunction.class, BUCKET_FUNCTION,
TruncateFunction.class, TRUNCATE_FUNCTION);
Copy link
Contributor Author

@wypoon wypoon Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make a choice here. CLASS_TO_FUNCTIONS is used by loadFunctionsByClass(Class) and that is called by the ReplaceStaticInvoke rule. The StaticInvoke contains a class and that is all we have to go by; we do not know if the class belongs to an instance with singular equal to true or false. So I map the classes to function instances of the plural form (to preserve existing behavior).
I didn't feel it worthwhile to have two classes instead of one to distinguish between singular and plural forms.
The tradeoff is that if we inspect the internals of a Spark LogicalPlan and we look at an ApplyFunctionExpression corresponding to e.g., applying YearsFunction.TimestampToYearsFunction, and asks it for its name, we will get "years" (which comes from YearsFunction.TimestampToYearsFunction::name()) regardless of whether we called year or years in our SQL query. (This happens in the verification part of some tests.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the changes in SparkV2Filters, changing CLASS_TO_FUNCTIONS to use the singular forms would cause test failures in TestSystemFunctionPushDownInRowLevelOperations and TestSystemFunctionPushDownDQL (even after accounting for the name when verifying the ApplyFunctionExpression).
However, with the changes in SparkV2Filters, we are free to choose the singular forms as standard if we wish.

Comment on lines -73 to +74
ImmutableSet.of("years", "months", "days", "hours", "bucket", "truncate");
ImmutableSet.of(
"year", "years", "month", "months", "day", "days", "hour", "hours", "bucket", "truncate");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of the hardcoded references to "years", "months", "days" and "hours" in SparkV2Filters.
The changes here in SparkV2Filters are only necessary if we want to make the singular forms the standard.
Right now, I have left the plural forms as standard.
@rdblue should we make the singular form of the functions the standard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wypoon thanks for working on this. Initially, we went for the plural form because the singular form is already a function in Spark (e.g. day), do you know if these interfere? We should also make sure that we convert this in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko thanks for looking at this, and the explanation for why the plural form was used.
There is no interference between the Iceberg functions and the Spark built-in functions, because the Iceberg functions use the "system" namespace. I added tests for the built-in functions to demonstrate this.
There is one thing that gave me pause, which is the comment in BaseCatalog::isFunctionNamespace:

    // Allow for empty namespace, as Spark's storage partitioned joins look up
    // the corresponding functions to generate transforms for partitioning
    // with an empty namespace, such as `bucket`.
    // Otherwise, use `system` namespace.

For this reason, I added some variants to TestStoragePartitionJoins to use the singular forms of the transforms. Those tests pass too.

Comment on lines +73 to +76
String tz = TimeZone.getDefault().getID();
assertThat(scalarSql(String.format("SELECT hour(TIMESTAMP '2017-12-01 10:12:55 %s')", tz)))
.as("Expected to produce 10")
.isEqualTo(10);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is how Spark SQL's TIMESTAMP_LTZ support works. The instance in time is converted to the local timezone and the hour function is applied to that. Thus if the instance in time is already specified using the local timezone, when converted the hour hasn't changed, so we know what it is.

... that were not bullet-proof.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants