Performance: Add "read strings as binary" option for parquet

### 
TLDR I would like to add a new `binary_as_string` option for paruet

```sql
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits_partitioned'
OPTIONS ('binary_as_string' 'true');
```

### Is your feature request related to a problem or challenge?

## The Real Problem

The primary problem is that the ClickBench queries  [slow down when we enable StringView by default](https://github.com/apache/datafusion/issues/11682) only for the `hits_partitioned` version of the dataset

One of the reasons is that reading a column as a `BinaryViewArray` and then casting to `Utf8ViewArray` is significantly slower than reading the data from parquet as a `Utf8ViewArray` (due to optimizations in the parquet decoder).

This is not a problem with `StringArray` --> `BinaryArray` because reading a column as a `BinaryArray` and then casting to `Utf8Array` is about the same speed as reading as Utf8Array

The core issue is that for `hits_partitioned` the "String" columns in the schema are marked as binary (not Utf8) and thus the slower conversion path is used.

## The background: `hits_partitioned` has "string" columns marked as "Binary"
Clickbench has 2 versions of the parquet dataset (see [docs here](https://github.com/ClickHouse/ClickBench?tab=readme-ov-file#data-loading))
1. `hits.parquet` (a single 14G parquet file)
2. `athena_partitioned/hits_{0..99}.parquet`

However, the SCHEMA is different between these two files

### `hits.parquet` has `String`s:

```
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ parquet-schema hits.parquet
Metadata for file: hits.parquet

version: 1
num of rows: 99997497
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
  REQUIRED INT64 WatchID;
  REQUIRED INT32 JavaEnable (INTEGER(16,true));
  REQUIRED BYTE_ARRAY Title (STRING);    <---  Annotated with "(String)" logical type
```

DataFusion recognizes this as Utf8
```sql
> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | NO          |
| JavaEnable            | Int16     | NO          |
| Title                 | Utf8      | NO          | <-- StringArray!
...
```

### `hits_partitioned` has the string columns as `Binary`:

```
$ parquet-schema hits_partitioned/hits_22.parquet
Metadata for file: hits_partitioned/hits_22.parquet

version: 1
num of rows: 1000000
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
  OPTIONAL INT64 WatchID;
  OPTIONAL INT32 JavaEnable (INTEGER(16,true));
  OPTIONAL BYTE_ARRAY Title;   <---- this is NOT annotated with String logical type
```

Which datafusion correctly interprets as `Binary`:

```sql
DataFusion CLI v42.0.0
> describe 'hits_partitioned';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | YES         |
| JavaEnable            | Int16     | YES         |
| Title                 | Binary    | YES         |  <-- BinaryArray
...
```

### Describe the solution you'd like

I would like a way to treat binary columns in the hits_partitioned dataset as Strings. 

This is the right thing to do for the `hits_partitioned` dataset, but I am not sure it is the right thing to do for all files, so I think we need some flag



### Describe alternatives you've considered

I propose adding a `binary_as_string` option to the parquet reader like

```sql
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits_partitioned'
OPTIONS ('binary_as_string' 'true');
```

In fact, it seems as if duckDB does exactly this note the (`binary_as_string=True`) item here

https://duckdb.org/docs/data/parquet/overview.html#parameters


Name | Description | Type | Default
-- | -- | -- | --
binary_as_string | Parquet files generated by legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. Set this to true to load binary columns as strings. | BOOL | false

And it is set in clickbench scripts:

https://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/duckdb-parquet/create.sql#L6


### Additional context

Trino also explictly specifies the types of files in the `hits_partitioned` dataset
https://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/trino/create_partitioned.sql#L1-L107

We could argue that adding something just for benchmarking goes against the spirit of the test, however, I think we can make a reasonable argument on correctness grounds too


For example, if you run this query today

```shell
SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;
```

You get this arguably nonsensical result (the search phrase is shown as binary):

```sql
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+--------------------------------------------------------------------------------------------------+-------+
| SearchEngineID | SearchPhrase                                                                                     | c     |
+----------------+--------------------------------------------------------------------------------------------------+-------+
| 2              | d0bad0b0d180d0b5d0bbd0bad0b8                                                                     | 46258 |
| 2              | d0bcd0b0d0bdd0b3d18320d0b220d0b7d0b0d180d0b0d0b1d0b5d0b920d0b3d180d0b0d0bcd0b0                   | 18871 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd                                       | 16905 |
| 3              | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd                                                     | 16748 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd20d0b1d0b5d181d0bfd0bbd0b0d182d0bdd0be | 14909 |
| 2              | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd                                                     | 13716 |
| 2              | d18dd0bad0b7d0bed0b8d0b4d0bdd18bd0b5                                                             | 13414 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c                                                                 | 13108 |
| 3              | d0bad0b0d180d0b5d0bbd0bad0b8                                                                     | 12815 |
| 2              | d0b4d180d183d0b6d0bad0b520d0bfd0bed0bcd0b5d189d0b5d0bdd0b8d0b5                                   | 11946 |
+----------------+--------------------------------------------------------------------------------------------------+-------+
10 row(s) fetched.
Elapsed 0.561 seconds.
```

However if you run the same query with the proper schema (strings) you see a real search phrase 

```sql
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+---------------------------+-------+
| SearchEngineID | SearchPhrase              | c     |
+----------------+---------------------------+-------+
| 2              | карелки                   | 46258 |
| 2              | мангу в зарабей грама     | 18871 |
| 2              | смотреть онлайн           | 16905 |
| 3              | албатрутдин               | 16748 |
| 2              | смотреть онлайн бесплатно | 14909 |
| 2              | албатрутдин               | 13716 |
| 2              | экзоидные                 | 13414 |
| 2              | смотреть                  | 13108 |
| 3              | карелки                   | 12815 |
| 2              | дружке помещение          | 11946 |
+----------------+---------------------------+-------+
10 row(s) fetched.
Elapsed 0.569 seconds.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance: Add "read strings as binary" option for parquet #12788

Is your feature request related to a problem or challenge?

The Real Problem

The background: `hits_partitioned` has "string" columns marked as "Binary"

`hits.parquet` has `String`s:

`hits_partitioned` has the string columns as `Binary`:

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance: Add "read strings as binary" option for parquet #12788

Description

Is your feature request related to a problem or challenge?

The Real Problem

The background: hits_partitioned has "string" columns marked as "Binary"

hits.parquet has Strings:

hits_partitioned has the string columns as Binary:

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The background: `hits_partitioned` has "string" columns marked as "Binary"

`hits.parquet` has `String`s:

`hits_partitioned` has the string columns as `Binary`: