Skip to content

Conversation

rafafrdz
Copy link
Contributor

@rafafrdz rafafrdz commented Sep 9, 2025

Which issue does this PR close?

part of #15914

Rationale for this change

Migrate Spark functions from https://github.com/lakehq/sail/ to DataFusion engine to unify codebase

What changes are included in this PR?

Fix

  • PATH part: if there is no path pat it returns an empty string (not "/")
  • AUTHORITY part: the authority section must include the port if the URL specifies one. However, at the moment, if the authority contains a well-known port (e.g., 80, 23, etc.), it is omitted and only shown when the port is a non-standard (“custom”) one.

Are these changes tested?

unit-tests and sqllogictests added

Are there any user-facing changes?

Now can be called in queries

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Sep 9, 2025
@rafafrdz rafafrdz marked this pull request as draft September 9, 2025 12:39
@rafafrdz rafafrdz marked this pull request as ready for review September 9, 2025 14:40
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some CI failures to address

Comment on lines 50 to 49
signature: Signature::one_of(
vec![
TypeSignature::Uniform(
1,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
2,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
3,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
],
Volatility::Immutable,
),
signature: Signature::user_defined(Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change interfere with dictionary types now? I think this works on main but doesn't work on this PR:

query T
SELECT parse_url(arrow_cast('http://spark.apache.org/path?query=1', 'Dictionary(Int32, Utf8)'), 'HOST'::string);
----
spark.apache.org

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don’t quite follow… Why should dictionary types be used as arguments for parse_url or try_parse_url? As far as I understand, the spec only expects string types parse_url doc ref

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary type is another way to represent strings, similar to how view type is a different way to represent strings.

Can see here in the doc there's a reference to handling dictionary types:

/// One or arguments of all the same string types.
///
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8.
/// Null is considered as `Utf8` by default
/// Dictionary with string value type is also handled.
///
/// For example, if a function is called with (utf8, large_utf8), all
/// arguments will be coerced to `LargeUtf8`
///
/// For functions that take no arguments (e.g. `random()` use [`TypeSignature::Nullary`]).
String(usize),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me check it out then 😃

Copy link
Contributor Author

@rafafrdz rafafrdz Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rereading this several times, my understanding is that when you pass a Dictionary with string values, DataFusion attempts to match it against the String signature. However, parse_url is defined to accept only plain string arguments ref. It does not expect any dictionary inputs.

We mark the UDF’s signature as user_defined to enable coercion across string types (Utf8, Utf8View, LargeUtf8), but a dictionary array is still not a string type, so it isn’t coerced, and the call won’t match.

In short, even if the String signature seems to "capture" Dictionary with string values, parse_url will still reject them because the underlying physical type is a dictionary, not a string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but my point was previously (on main), parse_url seems to be capable of accepting string dictionary types, but with these changes that is no longer possible; is this something to be concerned about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to simplify the type signature to just be:

signature: Signature::one_of(
    vec![
        TypeSignature::String(1),
        TypeSignature::String(2),
        TypeSignature::String(3),
    ],
    Volatility::Immutable,
),

So they get cast to the same string type? This would avoid the need for the large match statement in spark_handled_parse_url as well.

See related comment on other PR: #17195 (comment)

@rafafrdz rafafrdz marked this pull request as draft September 11, 2025 11:04
@rafafrdz rafafrdz marked this pull request as ready for review September 11, 2025 13:59
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some fixes for PATH and AUTHORITY are now included in this PR, so would be good to call this out in the PR body.

Comment on lines 50 to 49
signature: Signature::one_of(
vec![
TypeSignature::Uniform(
1,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
2,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
TypeSignature::Uniform(
3,
vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8],
),
],
Volatility::Immutable,
),
signature: Signature::user_defined(Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but my point was previously (on main), parse_url seems to be capable of accepting string dictionary types, but with these changes that is no longer possible; is this something to be concerned about?

Arc::new(LargeStringArray::from(vals.to_vec())) as ArrayRef
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel quite a few of these tests could be done as slt tests; @alamb do we have a preference on where tests should be done? Should we prefer slt over rust tests, and fallback only to rust if it is something that slt can't handle?

Took a look at https://datafusion.apache.org/contributor-guide/testing.html but it doesn't mention if we have a specific preference, other than slt's being easier to maintain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer slt tests. But i agree we don't have clear guidance

@rafafrdz rafafrdz requested a review from Jefffrey September 12, 2025 17:04
@Jefffrey
Copy link
Contributor

I recommend running cargo clippy & test locally first so we don't need to find out via the CI checks

@rafafrdz
Copy link
Contributor Author

I recommend running cargo clippy & test locally first so we don't need to find out via the CI checks

done

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion, other than that this should be good to go once the CI failures are resolved 👍

@Jefffrey
Copy link
Contributor

Looks like some test failures:

1. query result mismatch:
[SQL] SELECT parse_url('http://[email protected]/path?query=1#Ref'::string, 'AUTHORITY'::string);
[Diff] (-expected|+actual)
-   userinfo@spark.apache.org
+   userinfo@spark.apache.org:80
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/spark/url/parse_url.slt:63


2. query failed: DataFusion error: Execution error: RelativeUrlWithoutBase
[SQL] SELECT parse_url('www.example.com/path?x=1', 'HOST');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/spark/url/parse_url.slt:83


3. query failed: DataFusion error: Execution error: RelativeUrlWithoutBase
[SQL] SELECT parse_url('www.example.com/path?x=1', 'host');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/spark/url/parse_url.slt:88


4. query failed: DataFusion error: Execution error: RelativeUrlWithoutBase
[SQL] SELECT parse_url('notaurl', 'HOST');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/spark/url/parse_url.slt:133


5. query failed: DataFusion error: Execution error: RelativeUrlWithoutBase
[SQL] SELECT parse_url('notaurl', 'host');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/spark/url/parse_url.slt:138

For reference I checked expected outputs against Spark 4.0.0:

spark-sql (default)> SELECT parse_url('http://[email protected]/path?query=1#Ref'::string, 'AUTHORITY'::string);
userinfo@spark.apache.org
Time taken: 0.779 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT parse_url('www.example.com/path?x=1', 'HOST');
NULL
Time taken: 0.054 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT parse_url('www.example.com/path?x=1', 'host');
NULL
Time taken: 0.041 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT parse_url('notaurl', 'HOST');
NULL
Time taken: 0.022 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT parse_url('notaurl', 'host');
NULL
Time taken: 0.021 seconds, Fetched 1 row(s)
spark-sql (default)>

Also I noticed we don't have a test that shows the difference between parse_url and try_parse_url; we could use this one from the Spark docs: https://spark.apache.org/docs/latest/api/sql/index.html#try_parse_url

SELECT try_parse_url('inva lid://spark.apache.org/path?query=1', 'QUERY');

Where for parse_url it errors as expected but try_parse_url returns null

@alamb
Copy link
Contributor

alamb commented Oct 4, 2025

@rafafrdz -- do you think you'll be able to help finish this PR any time soon?

@rafafrdz
Copy link
Contributor Author

rafafrdz commented Oct 4, 2025

@alamb @Jefffrey sorry, I've been too bussy these last weeks. I'll try to finish it by this weekend

@alamb alamb changed the title feat(spark): implement Spark try_parse_url function feat(spark): implement Spark parse_url and try_parse_url function Oct 6, 2025
@alamb alamb changed the title feat(spark): implement Spark parse_url and try_parse_url function feat(spark): implement Spark try_parse_url function Oct 6, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @rafafrdz and @Jefffrey

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

(this is a really nice addition)

@alamb alamb added this pull request to the merge queue Oct 7, 2025
Merged via the queue into apache:main with commit 58ddf0d Oct 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants