Closed
Description
Describe the bug
While working on #12415, I found the LIKE
and ILIKE
behavior differs between StringView
and other string types. Given the following data and SQL:
DataFusion CLI v42.0.0
> create table test_source as values
('Andrew', 'X', 'datafusion📊🔥', '🔥'),
('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
(NULL, 'R', NULL, '🔥');
0 row(s) fetched.
Elapsed 0.028 seconds.
> create table test_basic_operator_utf8view as
select
arrow_cast(column1, 'Utf8View') as ascii_1,
arrow_cast(column2, 'Utf8View') as ascii_2,
arrow_cast(column3, 'Utf8View') as unicode_1,
arrow_cast(column4, 'Utf8View') as unicode_2
from test_source;
0 row(s) fetched.
Elapsed 0.002 seconds.
> create table test_basic_operator_utf8 as
select
arrow_cast(column1, 'Utf8') as ascii_1,
arrow_cast(column2, 'Utf8') as ascii_2,
arrow_cast(column3, 'Utf8') as unicode_1,
arrow_cast(column4, 'Utf8') as unicode_2
from test_source;
0 row(s) fetched.
Elapsed 0.004 seconds.
-- StringView Table ( pattern contain % )
> select ascii_1, unicode_1,
ascii_1 like 'An%' as ascii_like,
unicode_1 like '%ion数据%' as unicode_like,
ascii_1 ilike 'An%' as ascii_ilike,
unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1 | unicode_1 | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew | datafusion📊🔥 | true | false | true | false |
| Xiangpeng | datafusion数据融合 | false | true | false | true |
| Raphael | datafusionДатаФусион | false | false | false | false |
| | | false | false | false | |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
-- StringView Table ( pattern without % )
> select ascii_1, unicode_1,
ascii_1 like 'An' as ascii_like,
unicode_1 like '%ion数据' as unicode_like,
ascii_1 ilike 'An' as ascii_ilike,
unicode_1 ilike 'ion数据' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1 | unicode_1 | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew | datafusion📊🔥 | false | false | false | false |
| Xiangpeng | datafusion数据融合 | false | false | false | false |
| Raphael | datafusionДатаФусион | false | false | false | false |
| | | | false | | |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
-- String Table (same as LargeString and DictionaryString)
> select ascii_1, unicode_1,
ascii_1 like 'An%' as ascii_like,
unicode_1 like '%ion数据%' as unicode_like,
ascii_1 ilike 'An%' as ascii_ilike,
unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1 | unicode_1 | ascii_like | unicode_like | ascii_ilike | unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew | datafusion📊🔥 | true | false | true | false |
| Xiangpeng | datafusion数据融合 | false | true | false | true |
| Raphael | datafusionДатаФусион | false | false | false | false |
| | | | | | |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
When the input value is NULL, string type will return NULL
but string view will return false. (Something is interesting about the ILIKE operation is different between ASCII-only and Unicode StringView
🤔 )
Some testing for StringView ScalarValue
When the matching pattern contains %
, it will return false
instead null
.
> select
arrow_cast(null, 'Utf8View') like '123' as utf8view_like_ascii,
arrow_cast(null, 'Utf8View') ilike '123' as utf8view_ilike_ascii,
arrow_cast(null, 'Utf8View') like '123%' as "utf8view_like_ascii%",
arrow_cast(null, 'Utf8View') ilike '123%' as "utf8view_ilike_ascii%";
+---------------------+----------------------+----------------------+-----------------------+
| utf8view_like_ascii | utf8view_ilike_ascii | utf8view_like_ascii% | utf8view_ilike_ascii% |
+---------------------+----------------------+----------------------+-----------------------+
| | | false | false |
+---------------------+----------------------+----------------------+-----------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select
arrow_cast(null, 'Utf8View') like '數據' as "utf8view_like_unicode",
arrow_cast(null, 'Utf8View') ilike '數據' as "utf8view_ilike_unicode",
arrow_cast(null, 'Utf8View') like '數據%' as "utf8view_like_unicode%",
arrow_cast(null, 'Utf8View') ilike '數據%' as "utf8view_ilike_unicode%";
+-----------------------+------------------------+------------------------+-------------------------+
| utf8view_like_unicode | utf8view_ilike_unicode | utf8view_like_unicode% | utf8view_ilike_unicode% |
+-----------------------+------------------------+------------------------+-------------------------+
| | | false | |
+-----------------------+------------------------+------------------------+-------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
Some testing for String ScalarValue (Same as LargeString and DictionaryString)
> select
arrow_cast(null, 'Utf8') like '123' as utf8_like_ascii,
arrow_cast(null, 'Utf8') ilike '123' as utf8_ilike_ascii,
arrow_cast(null, 'Utf8') like '123%' as "utf8_like_ascii%",
arrow_cast(null, 'Utf8') ilike '123%' as "utf8_ilike_ascii%";
+-----------------+------------------+------------------+-------------------+
| utf8_like_ascii | utf8_ilike_ascii | utf8_like_ascii% | utf8_ilike_ascii% |
+-----------------+------------------+------------------+-------------------+
| | | | |
+-----------------+------------------+------------------+-------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select
arrow_cast(null, 'Utf8') like '數據' as "utf8_like_unicode",
arrow_cast(null, 'Utf8') ilike '數據' as "utf8_ilike_unicode",
arrow_cast(null, 'Utf8') like '數據%' as "utf8_like_unicode%",
arrow_cast(null, 'Utf8') ilike '數據%' as "utf8_ilike_unicode%";
+-------------------+--------------------+--------------------+---------------------+
| utf8_like_unicode | utf8_ilike_unicode | utf8_like_unicode% | utf8_ilike_unicode% |
+-------------------+--------------------+--------------------+---------------------+
| | | | |
+-------------------+--------------------+--------------------+---------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
To Reproduce
Run the SQLs mentioned above.
Expected behavior
I'm not really sure if the behavior of StringView is expected 🤔 but I think their behavior should be consistent.
If the input is null, the like and ilike should return null.
Additional context
No response