Description
TLDR I would like to add a new binary_as_string
option for paruet
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits_partitioned'
OPTIONS ('binary_as_string' 'true');
Is your feature request related to a problem or challenge?
The Real Problem
The primary problem is that the ClickBench queries slow down when we enable StringView by default only for the hits_partitioned
version of the dataset
One of the reasons is that reading a column as a BinaryViewArray
and then casting to Utf8ViewArray
is significantly slower than reading the data from parquet as a Utf8ViewArray
(due to optimizations in the parquet decoder).
This is not a problem with StringArray
--> BinaryArray
because reading a column as a BinaryArray
and then casting to Utf8Array
is about the same speed as reading as Utf8Array
The core issue is that for hits_partitioned
the "String" columns in the schema are marked as binary (not Utf8) and thus the slower conversion path is used.
The background: hits_partitioned
has "string" columns marked as "Binary"
Clickbench has 2 versions of the parquet dataset (see docs here)
hits.parquet
(a single 14G parquet file)athena_partitioned/hits_{0..99}.parquet
However, the SCHEMA is different between these two files
hits.parquet
has String
s:
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ parquet-schema hits.parquet
Metadata for file: hits.parquet
version: 1
num of rows: 99997497
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
REQUIRED INT64 WatchID;
REQUIRED INT32 JavaEnable (INTEGER(16,true));
REQUIRED BYTE_ARRAY Title (STRING); <--- Annotated with "(String)" logical type
DataFusion recognizes this as Utf8
> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | NO |
| JavaEnable | Int16 | NO |
| Title | Utf8 | NO | <-- StringArray!
...
hits_partitioned
has the string columns as Binary
:
$ parquet-schema hits_partitioned/hits_22.parquet
Metadata for file: hits_partitioned/hits_22.parquet
version: 1
num of rows: 1000000
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
OPTIONAL INT64 WatchID;
OPTIONAL INT32 JavaEnable (INTEGER(16,true));
OPTIONAL BYTE_ARRAY Title; <---- this is NOT annotated with String logical type
Which datafusion correctly interprets as Binary
:
DataFusion CLI v42.0.0
> describe 'hits_partitioned';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | YES |
| JavaEnable | Int16 | YES |
| Title | Binary | YES | <-- BinaryArray
...
Describe the solution you'd like
I would like a way to treat binary columns in the hits_partitioned dataset as Strings.
This is the right thing to do for the hits_partitioned
dataset, but I am not sure it is the right thing to do for all files, so I think we need some flag
Describe alternatives you've considered
I propose adding a binary_as_string
option to the parquet reader like
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits_partitioned'
OPTIONS ('binary_as_string' 'true');
In fact, it seems as if duckDB does exactly this note the (binary_as_string=True
) item here
https://duckdb.org/docs/data/parquet/overview.html#parameters
Name | Description | Type | Default |
---|---|---|---|
binary_as_string | Parquet files generated by legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. Set this to true to load binary columns as strings. | BOOL | false |
And it is set in clickbench scripts:
Additional context
Trino also explictly specifies the types of files in the hits_partitioned
dataset
https://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/trino/create_partitioned.sql#L1-L107
We could argue that adding something just for benchmarking goes against the spirit of the test, however, I think we can make a reasonable argument on correctness grounds too
For example, if you run this query today
SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;
You get this arguably nonsensical result (the search phrase is shown as binary):
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+--------------------------------------------------------------------------------------------------+-------+
| SearchEngineID | SearchPhrase | c |
+----------------+--------------------------------------------------------------------------------------------------+-------+
| 2 | d0bad0b0d180d0b5d0bbd0bad0b8 | 46258 |
| 2 | d0bcd0b0d0bdd0b3d18320d0b220d0b7d0b0d180d0b0d0b1d0b5d0b920d0b3d180d0b0d0bcd0b0 | 18871 |
| 2 | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd | 16905 |
| 3 | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd | 16748 |
| 2 | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd20d0b1d0b5d181d0bfd0bbd0b0d182d0bdd0be | 14909 |
| 2 | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd | 13716 |
| 2 | d18dd0bad0b7d0bed0b8d0b4d0bdd18bd0b5 | 13414 |
| 2 | d181d0bcd0bed182d180d0b5d182d18c | 13108 |
| 3 | d0bad0b0d180d0b5d0bbd0bad0b8 | 12815 |
| 2 | d0b4d180d183d0b6d0bad0b520d0bfd0bed0bcd0b5d189d0b5d0bdd0b8d0b5 | 11946 |
+----------------+--------------------------------------------------------------------------------------------------+-------+
10 row(s) fetched.
Elapsed 0.561 seconds.
However if you run the same query with the proper schema (strings) you see a real search phrase
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+---------------------------+-------+
| SearchEngineID | SearchPhrase | c |
+----------------+---------------------------+-------+
| 2 | карелки | 46258 |
| 2 | мангу в зарабей грама | 18871 |
| 2 | смотреть онлайн | 16905 |
| 3 | албатрутдин | 16748 |
| 2 | смотреть онлайн бесплатно | 14909 |
| 2 | албатрутдин | 13716 |
| 2 | экзоидные | 13414 |
| 2 | смотреть | 13108 |
| 3 | карелки | 12815 |
| 2 | дружке помещение | 11946 |
+----------------+---------------------------+-------+
10 row(s) fetched.
Elapsed 0.569 seconds.