Skip to content

Performance: Add "read strings as binary" option for parquet #12788

Closed
@alamb

Description

@alamb

TLDR I would like to add a new binary_as_string option for paruet

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits_partitioned'
OPTIONS ('binary_as_string' 'true');

Is your feature request related to a problem or challenge?

The Real Problem

The primary problem is that the ClickBench queries slow down when we enable StringView by default only for the hits_partitioned version of the dataset

One of the reasons is that reading a column as a BinaryViewArray and then casting to Utf8ViewArray is significantly slower than reading the data from parquet as a Utf8ViewArray (due to optimizations in the parquet decoder).

This is not a problem with StringArray --> BinaryArray because reading a column as a BinaryArray and then casting to Utf8Array is about the same speed as reading as Utf8Array

The core issue is that for hits_partitioned the "String" columns in the schema are marked as binary (not Utf8) and thus the slower conversion path is used.

The background: hits_partitioned has "string" columns marked as "Binary"

Clickbench has 2 versions of the parquet dataset (see docs here)

  1. hits.parquet (a single 14G parquet file)
  2. athena_partitioned/hits_{0..99}.parquet

However, the SCHEMA is different between these two files

hits.parquet has Strings:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ parquet-schema hits.parquet
Metadata for file: hits.parquet

version: 1
num of rows: 99997497
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
  REQUIRED INT64 WatchID;
  REQUIRED INT32 JavaEnable (INTEGER(16,true));
  REQUIRED BYTE_ARRAY Title (STRING);    <---  Annotated with "(String)" logical type

DataFusion recognizes this as Utf8

> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | NO          |
| JavaEnable            | Int16     | NO          |
| Title                 | Utf8      | NO          | <-- StringArray!
...

hits_partitioned has the string columns as Binary:

$ parquet-schema hits_partitioned/hits_22.parquet
Metadata for file: hits_partitioned/hits_22.parquet

version: 1
num of rows: 1000000
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
  OPTIONAL INT64 WatchID;
  OPTIONAL INT32 JavaEnable (INTEGER(16,true));
  OPTIONAL BYTE_ARRAY Title;   <---- this is NOT annotated with String logical type

Which datafusion correctly interprets as Binary:

DataFusion CLI v42.0.0
> describe 'hits_partitioned';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | YES         |
| JavaEnable            | Int16     | YES         |
| Title                 | Binary    | YES         |  <-- BinaryArray
...

Describe the solution you'd like

I would like a way to treat binary columns in the hits_partitioned dataset as Strings.

This is the right thing to do for the hits_partitioned dataset, but I am not sure it is the right thing to do for all files, so I think we need some flag

Describe alternatives you've considered

I propose adding a binary_as_string option to the parquet reader like

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits_partitioned'
OPTIONS ('binary_as_string' 'true');

In fact, it seems as if duckDB does exactly this note the (binary_as_string=True) item here

https://duckdb.org/docs/data/parquet/overview.html#parameters

Name Description Type Default
binary_as_string Parquet files generated by legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. Set this to true to load binary columns as strings. BOOL false

And it is set in clickbench scripts:

https://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/duckdb-parquet/create.sql#L6

Additional context

Trino also explictly specifies the types of files in the hits_partitioned dataset
https://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/trino/create_partitioned.sql#L1-L107

We could argue that adding something just for benchmarking goes against the spirit of the test, however, I think we can make a reasonable argument on correctness grounds too

For example, if you run this query today

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

You get this arguably nonsensical result (the search phrase is shown as binary):

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+--------------------------------------------------------------------------------------------------+-------+
| SearchEngineID | SearchPhrase                                                                                     | c     |
+----------------+--------------------------------------------------------------------------------------------------+-------+
| 2              | d0bad0b0d180d0b5d0bbd0bad0b8                                                                     | 46258 |
| 2              | d0bcd0b0d0bdd0b3d18320d0b220d0b7d0b0d180d0b0d0b1d0b5d0b920d0b3d180d0b0d0bcd0b0                   | 18871 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd                                       | 16905 |
| 3              | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd                                                     | 16748 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd20d0b1d0b5d181d0bfd0bbd0b0d182d0bdd0be | 14909 |
| 2              | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd                                                     | 13716 |
| 2              | d18dd0bad0b7d0bed0b8d0b4d0bdd18bd0b5                                                             | 13414 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c                                                                 | 13108 |
| 3              | d0bad0b0d180d0b5d0bbd0bad0b8                                                                     | 12815 |
| 2              | d0b4d180d183d0b6d0bad0b520d0bfd0bed0bcd0b5d189d0b5d0bdd0b8d0b5                                   | 11946 |
+----------------+--------------------------------------------------------------------------------------------------+-------+
10 row(s) fetched.
Elapsed 0.561 seconds.

However if you run the same query with the proper schema (strings) you see a real search phrase

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+---------------------------+-------+
| SearchEngineID | SearchPhrase              | c     |
+----------------+---------------------------+-------+
| 2              | карелки                   | 46258 |
| 2              | мангу в зарабей грама     | 18871 |
| 2              | смотреть онлайн           | 16905 |
| 3              | албатрутдин               | 16748 |
| 2              | смотреть онлайн бесплатно | 14909 |
| 2              | албатрутдин               | 13716 |
| 2              | экзоидные                 | 13414 |
| 2              | смотреть                  | 13108 |
| 3              | карелки                   | 12815 |
| 2              | дружке помещение          | 11946 |
+----------------+---------------------------+-------+
10 row(s) fetched.
Elapsed 0.569 seconds.

Metadata

Metadata

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions