Skip to content

feat: add has_overlap function to functions_list.yaml#987

Open
benbellick wants to merge 5 commits intomainfrom
benbellick/arrays-overlap
Open

feat: add has_overlap function to functions_list.yaml#987
benbellick wants to merge 5 commits intomainfrom
benbellick/arrays-overlap

Conversation

@benbellick
Copy link
Copy Markdown
Member

@benbellick benbellick commented Mar 4, 2026

Closes #986

Adds a has_overlap scalar function to functions_list.yaml that determines whether two lists share any common elements.

This was initially placed in functions_set.yaml, but per discussion it makes more sense in functions_list.yaml since the function operates on list types and bundling functions under a function_<datatype> namespace is the more natural convention.

Cross-Engine Semantic Comparison

The following table was generated by running equivalent queries across engines via Docker (see details below for the script used to generate it).

Test Case PostgreSQL DuckDB Spark Trino DataFusion ClickHouse
[1,2,3] && [3,4,5] true true true true true true
[1,2,3] && [4,5,6] false false false false false false
[1,2,3] && [] false false false false false false
[] && [] false false false false false false
[1,NULL,3] && [3,4] true true true true true true
⚠️ [1,NULL,3] && [4,5] false false NULL NULL false false
⚠️ [1,NULL] && [NULL,4] false false NULL NULL true true
NULL && [1,2] NULL NULL NULL NULL NULL NULL
[1,1,2] && [1,3] true true true true true true

The divergence is entirely about how NULL elements are handled when there is no definitive non-null overlap:

Behavior Engines
THREE_VALUED Spark, Trino
IGNORE_NULLS PostgreSQL, DuckDB
NULL_EQUALS_NULL DataFusion, ClickHouse

This is modeled via a null_handling option with those three values.

Note: test cases for null list arguments (e.g. has_overlap(null::list<i32>, ...)) will be added once #968 lands.

Cross-engine verification script
#!/bin/bash

# Start Trino in background
docker run --rm -d --name trino-test trinodb/trino:latest > /dev/null 2>&1
echo "Waiting for Trino to start..."
sleep 25

run_pg() {
    result=$(docker run --rm -e POSTGRES_PASSWORD=test postgres:latest su postgres -c "initdb -D /tmp/pgdata >/dev/null 2>&1 && pg_ctl start -D /tmp/pgdata -l /tmp/pg.log -o '-k /tmp' >/dev/null 2>&1 && sleep 2 && psql -h /tmp -t -A -c \"$1\"" 2>/dev/null)
    if [ -z "$result" ]; then echo "NULL"; else echo "$result"; fi
}

run_duckdb() {
    result=$(docker run --rm duckdb/duckdb:latest duckdb -noheader -list -c "$1" 2>/dev/null)
    if [ -z "$result" ]; then echo "NULL"; else echo "$result"; fi
}

run_spark() {
    result=$(docker run --rm apache/spark:latest /opt/spark/bin/spark-sql -e "$1" 2>&1 | grep -v "WARN\|INFO\|Time taken\|Spark Web\|Spark master\|ObjectStore\|^Using\|^Setting\|^24\|^Ivy\|^Preparing\|^Resolving\|^Downloading\|^::" | grep -v "^$" | tail -1)
    if [ -z "$result" ]; then echo "NULL"; else echo "$result"; fi
}

run_trino() {
    result=$(docker exec trino-test trino --execute "$1" 2>&1 | grep -v "WARNING\|jline" | tr -d '"')
    if [ -z "$result" ]; then echo "NULL"; else echo "$result"; fi
}

run_datafusion() {
    result=$(datafusion-cli -c "$1" 2>&1 | grep -E "^\|" | tail -1 | sed 's/|//g' | tr -d ' ')
    if [ -z "$result" ]; then echo "NULL"; else echo "$result"; fi
}

run_clickhouse() {
    result=$(docker run --rm clickhouse/clickhouse-server:latest clickhouse-local --query "$1" 2>/dev/null)
    if [ -z "$result" ]; then echo "NULL"; else echo "$result"; fi
}

run_test() {
    local label="$1" pg="$2" duck="$3" spark="$4" trino="$5" df="$6" ch="$7"
    echo ""
    echo "$label"
    echo "  PostgreSQL: $(run_pg "$pg")"
    echo "  DuckDB:     $(run_duckdb "$duck")"
    echo "  Spark:      $(run_spark "$spark")"
    echo "  Trino:      $(run_trino "$trino")"
    echo "  DataFusion: $(run_datafusion "$df")"
    echo "  ClickHouse: $(run_clickhouse "$ch")"
}

run_test "[1,2,3] && [3,4,5]" \
    "SELECT ARRAY[1,2,3] && ARRAY[3,4,5]" \
    "SELECT list_has_any([1,2,3], [3,4,5])" \
    "SELECT arrays_overlap(array(1,2,3), array(3,4,5))" \
    "SELECT arrays_overlap(ARRAY[1,2,3], ARRAY[3,4,5])" \
    "SELECT array_has_any(make_array(1,2,3), make_array(3,4,5))" \
    "SELECT hasAny([1,2,3], [3,4,5])"

run_test "[1,2,3] && [4,5,6]" \
    "SELECT ARRAY[1,2,3] && ARRAY[4,5,6]" \
    "SELECT list_has_any([1,2,3], [4,5,6])" \
    "SELECT arrays_overlap(array(1,2,3), array(4,5,6))" \
    "SELECT arrays_overlap(ARRAY[1,2,3], ARRAY[4,5,6])" \
    "SELECT array_has_any(make_array(1,2,3), make_array(4,5,6))" \
    "SELECT hasAny([1,2,3], [4,5,6])"

run_test "[1,2,3] && []" \
    "SELECT ARRAY[1,2,3] && ARRAY[]::int[]" \
    "SELECT list_has_any([1,2,3], []::INT[])" \
    "SELECT arrays_overlap(array(1,2,3), array())" \
    "SELECT arrays_overlap(ARRAY[1,2,3], ARRAY[])" \
    "SELECT array_has_any(make_array(1,2,3), make_array())" \
    "SELECT hasAny([1,2,3], [])"

run_test "[] && []" \
    "SELECT ARRAY[]::int[] && ARRAY[]::int[]" \
    "SELECT list_has_any([]::INT[], []::INT[])" \
    "SELECT arrays_overlap(array(), array())" \
    "SELECT arrays_overlap(ARRAY[], ARRAY[])" \
    "SELECT array_has_any(make_array(), make_array())" \
    "SELECT hasAny([], [])"

run_test "[1,NULL,3] && [3,4]" \
    "SELECT ARRAY[1,NULL,3] && ARRAY[3,4]" \
    "SELECT list_has_any([1,NULL,3], [3,4])" \
    "SELECT arrays_overlap(array(1,NULL,3), array(3,4))" \
    "SELECT arrays_overlap(ARRAY[1,NULL,3], ARRAY[3,4])" \
    "SELECT array_has_any(make_array(1,NULL,3), make_array(3,4))" \
    "SELECT hasAny([1,NULL,3], [3,4])"

run_test "[1,NULL,3] && [4,5]" \
    "SELECT ARRAY[1,NULL,3] && ARRAY[4,5]" \
    "SELECT list_has_any([1,NULL,3], [4,5])" \
    "SELECT arrays_overlap(array(1,NULL,3), array(4,5))" \
    "SELECT arrays_overlap(ARRAY[1,NULL,3], ARRAY[4,5])" \
    "SELECT array_has_any(make_array(1,NULL,3), make_array(4,5))" \
    "SELECT hasAny([1,NULL,3], [4,5])"

run_test "[1,NULL] && [NULL,4]" \
    "SELECT ARRAY[1,NULL] && ARRAY[NULL,4]" \
    "SELECT list_has_any([1,NULL], [NULL,4])" \
    "SELECT arrays_overlap(array(1,NULL), array(NULL,4))" \
    "SELECT arrays_overlap(ARRAY[1,NULL], ARRAY[NULL,4])" \
    "SELECT array_has_any(make_array(1,NULL), make_array(NULL,4))" \
    "SELECT hasAny([1,NULL], [NULL,4])"

run_test "NULL && [1,2]" \
    "SELECT NULL::int[] && ARRAY[1,2]" \
    "SELECT list_has_any(NULL::INT[], [1,2])" \
    "SELECT arrays_overlap(NULL, array(1,2))" \
    "SELECT arrays_overlap(NULL, ARRAY[1,2])" \
    "SELECT array_has_any(NULL::INT[], make_array(1,2))" \
    "SELECT hasAny([1,2]::Nullable(Array(Int32)), NULL)"

run_test "[1,1,2] && [1,3]" \
    "SELECT ARRAY[1,1,2] && ARRAY[1,3]" \
    "SELECT list_has_any([1,1,2], [1,3])" \
    "SELECT arrays_overlap(array(1,1,2), array(1,3))" \
    "SELECT arrays_overlap(ARRAY[1,1,2], ARRAY[1,3])" \
    "SELECT array_has_any(make_array(1,1,2), make_array(1,3))" \
    "SELECT hasAny([1,1,2], [1,3])"

docker stop trino-test > /dev/null 2>&1

Note: These changes were made with Claude's assistance. All code has been reviewed by me.


This change is Reviewable

@benbellick benbellick marked this pull request as ready for review March 4, 2026 18:32
elements overlap.

- THREE_VALUED: Returns NULL when null elements are present in
both lists but no non-null overlap is found.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as @vbarua had in the other PR... if option is not specified, I guess it is assumed to be engine dependent?

I keep forgetting the dialect... does the dialect allow specify supported options or enum arguments?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure what the right thing to do here is 😟

urn: extension:io.substrait:functions_set
scalar_functions:
-
name: "index_in"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's odd that this function is actually in the set 😄

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 🤦 if it returned a boolean I could see it. But returning the index? That isn't very set like 😅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to move this to list? Sorry, this comment was not published. It is technically a breaking change but wondering how many would have taken dependencies...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can now deprecate this and move it to list function! 😆 please consider it in different PR.

@benbellick benbellick changed the title feat: add has_overlap function to functions_set.yaml feat: add has_overlap function to functions_list.yaml Mar 5, 2026
@benbellick benbellick requested a review from yongchul March 5, 2026 22:34
urn: extension:io.substrait:functions_set
scalar_functions:
-
name: "index_in"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to move this to list? Sorry, this comment was not published. It is technically a breaking change but wondering how many would have taken dependencies...

@benbellick
Copy link
Copy Markdown
Member Author

@yongchul I agree it may be a good idea to just get rid of functions_set. In the interest of keeping this PR from becoming a breaking change, let's leave it for future work if that is alright with you :)

@yongchul
Copy link
Copy Markdown
Contributor

@yongchul I agree it may be a good idea to just get rid of functions_set. In the interest of keeping this PR from becoming a breaking change, let's leave it for future work if that is alright with you :)

I'm okay with keeping the file but it would be nice if we can do some prep work (i.e., duplicating the function and have description in the file for deprecation warning). It's fine to do that in a separate PR.

urn: extension:io.substrait:functions_set
scalar_functions:
-
name: "index_in"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can now deprecate this and move it to list function! 😆 please consider it in different PR.

- name: right
value: list<any1>
options:
null_handling:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents are enum for clarity and interoperability. One might argue that this is an option that should follow system default but

a. the alternatives are rather well-understood
b. it may not be consistent (i.e., system may not implement null handling consistently)
c. it works better with the dialect I believe...
d. null handling is... kind of the essential to the semantic of the overlap (null is not an exception in real data unfortunately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add arrays_overlap function to functions_list.yaml / functions_set.yaml

2 participants