Skip to content

Incorrect LIKE and ILIKE result for NULL input, `%', and '%%' pattern #12637

Closed
@goldmedal

Description

@goldmedal

Describe the bug

While working on #12415, I found the LIKE and ILIKE behavior differs between StringView and other string types. Given the following data and SQL:

DataFusion CLI v42.0.0
> create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '🔥'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
0 row(s) fetched. 
Elapsed 0.028 seconds.

> create table test_basic_operator_utf8view as
select
    arrow_cast(column1, 'Utf8View') as ascii_1,
    arrow_cast(column2, 'Utf8View') as ascii_2,
    arrow_cast(column3, 'Utf8View') as unicode_1,
    arrow_cast(column4, 'Utf8View') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> create table test_basic_operator_utf8 as
select
    arrow_cast(column1, 'Utf8') as ascii_1,
    arrow_cast(column2, 'Utf8') as ascii_2,
    arrow_cast(column3, 'Utf8') as unicode_1,
    arrow_cast(column4, 'Utf8') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.004 seconds.

-- StringView Table ( pattern contain % )
> select ascii_1, unicode_1,
       ascii_1 like 'An%' as ascii_like,
       unicode_1 like '%ion数据%' as unicode_like,
       ascii_1 ilike 'An%' as ascii_ilike,
       unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1   | unicode_1            | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew    | datafusion📊🔥       | true       | false        | true        | false        |
| Xiangpeng | datafusion数据融合   | false      | true         | false       | true         |
| Raphael   | datafusionДатаФусион | false      | false        | false       | false        |
|           |                      | false      | false        | false       |              |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched. 
Elapsed 0.004 seconds.

-- StringView Table ( pattern without % )
> select ascii_1, unicode_1,
       ascii_1 like 'An' as ascii_like,
       unicode_1 like '%ion数据' as unicode_like,
       ascii_1 ilike 'An' as ascii_ilike,
       unicode_1 ilike 'ion数据' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1   | unicode_1            | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew    | datafusion📊🔥       | false      | false        | false       | false        |
| Xiangpeng | datafusion数据融合   | false      | false        | false       | false        |
| Raphael   | datafusionДатаФусион | false      | false        | false       | false        |
|           |                      |            | false        |             |              |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched. 
Elapsed 0.004 seconds.

-- String Table (same as LargeString and DictionaryString)
> select ascii_1, unicode_1,
       ascii_1 like 'An%' as ascii_like,
       unicode_1 like '%ion数据%' as unicode_like,
       ascii_1 ilike 'An%' as ascii_ilike,
       unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1   | unicode_1            | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew    | datafusion📊🔥       | true       | false        | true        | false        |
| Xiangpeng | datafusion数据融合   | false      | true         | false       | true         |
| Raphael   | datafusionДатаФусион | false      | false        | false       | false        |
|           |                      |            |              |             |              |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched. 
Elapsed 0.004 seconds.

When the input value is NULL, string type will return NULL but string view will return false. (Something is interesting about the ILIKE operation is different between ASCII-only and Unicode StringView 🤔 )

Some testing for StringView ScalarValue

When the matching pattern contains %, it will return false instead null.

> select
    arrow_cast(null, 'Utf8View') like '123' as utf8view_like_ascii,
    arrow_cast(null, 'Utf8View') ilike '123' as utf8view_ilike_ascii,
    arrow_cast(null, 'Utf8View') like '123%' as "utf8view_like_ascii%",
    arrow_cast(null, 'Utf8View') ilike '123%' as "utf8view_ilike_ascii%";
+---------------------+----------------------+----------------------+-----------------------+
| utf8view_like_ascii | utf8view_ilike_ascii | utf8view_like_ascii% | utf8view_ilike_ascii% |
+---------------------+----------------------+----------------------+-----------------------+
|                     |                      | false                | false                 |
+---------------------+----------------------+----------------------+-----------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select    
    arrow_cast(null, 'Utf8View') like '數據' as "utf8view_like_unicode",
    arrow_cast(null, 'Utf8View') ilike '數據' as "utf8view_ilike_unicode",
    arrow_cast(null, 'Utf8View') like '數據%' as "utf8view_like_unicode%",
    arrow_cast(null, 'Utf8View') ilike '數據%' as "utf8view_ilike_unicode%";
+-----------------------+------------------------+------------------------+-------------------------+
| utf8view_like_unicode | utf8view_ilike_unicode | utf8view_like_unicode% | utf8view_ilike_unicode% |
+-----------------------+------------------------+------------------------+-------------------------+
|                       |                        | false                  |                         |
+-----------------------+------------------------+------------------------+-------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

Some testing for String ScalarValue (Same as LargeString and DictionaryString)

> select
    arrow_cast(null, 'Utf8') like '123' as utf8_like_ascii,
    arrow_cast(null, 'Utf8') ilike '123' as utf8_ilike_ascii,
    arrow_cast(null, 'Utf8') like '123%' as "utf8_like_ascii%",
    arrow_cast(null, 'Utf8') ilike '123%' as "utf8_ilike_ascii%";
+-----------------+------------------+------------------+-------------------+
| utf8_like_ascii | utf8_ilike_ascii | utf8_like_ascii% | utf8_ilike_ascii% |
+-----------------+------------------+------------------+-------------------+
|                 |                  |                  |                   |
+-----------------+------------------+------------------+-------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> select    
    arrow_cast(null, 'Utf8') like '數據' as "utf8_like_unicode",
    arrow_cast(null, 'Utf8') ilike '數據' as "utf8_ilike_unicode",
    arrow_cast(null, 'Utf8') like '數據%' as "utf8_like_unicode%",
    arrow_cast(null, 'Utf8') ilike '數據%' as "utf8_ilike_unicode%";
+-------------------+--------------------+--------------------+---------------------+
| utf8_like_unicode | utf8_ilike_unicode | utf8_like_unicode% | utf8_ilike_unicode% |
+-------------------+--------------------+--------------------+---------------------+
|                   |                    |                    |                     |
+-------------------+--------------------+--------------------+---------------------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

To Reproduce

Run the SQLs mentioned above.

Expected behavior

I'm not really sure if the behavior of StringView is expected 🤔 but I think their behavior should be consistent.
If the input is null, the like and ilike should return null.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingwaiting-on-upstreamPR is waiting on an upstream dependency to be updated

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions