[QST] Should byte_array_view in parquet reader/writer change

**What is your question?**
Should [`byte_array_view`](https://github.com/rapidsai/cudf/blob/branch-22.08/cpp/src/io/statistics/byte_array_view.cuh) change to a different implementation method or even go away completely.

### Motivation
When reviewing the `byte_array_view` PR it was brought up in [review comments](https://github.com/rapidsai/cudf/pull/11322#discussion_r928012252) that things could be done differently and possibly better. This issue is an attempt to bring this design out in the light and get some discourse going so we can build it the best way possible. Jake was, rightfully, concerned about the cognitive overload of having another object type that has to be understood, no matter how minimal the type turns out to be.

### Backstory and origin
The original thought was that it would be nice to leverage the existing templates in the statistics code to get elements and compute max/min just like everything else. This meant that `.element` on a column would be able to return a type that represents a `list<uint8>`. This is almost identical to a string column, so the thought was to have something analogous to `string_view` that could be used. This was quickly dismissed due to the issue of not having all list columns comprised of this thing and it felt like we were forcing something. All string columns are lists of chars, but not all list columns are lists of bytes.

### Requirements
The requirements in the statistics code are the ability to get an element from a table, compare elements, and compose an element from a pointer and a length. The statistics code goes to great length to type-erase the statistics blobs so they can be easily consumed at a large scale on the GPU and the reconstructs them later. It also uses `thrust::min` and `cub::reduceBlock` to process them, so comparison operators are needed.

### Slippery issues to understand
We can't use the same statistics types as strings because `string_view::max()` is actually not the same as a max byte or a max `byte_array_view`. The distinction is subtle, but important between all of them.

- The max UTF8 string is actually just 5 bytes long and [defined inside the `string_view` header](https://github.com/rapidsai/cudf/blob/03f1c1c5c5fcf90bd594aabd41b6e15f54690777/cpp/include/cudf/strings/string_view.cuh#L75). No UTF8 string can have a higher value, so comparisons work even though it isn't an infinitely-long character string as one would initially think.
- Maximum value for an unsigned byte is obviously 255, but this isn't the what is intended when one asks for the max byte array view. Instead, the goal is to know the "biggest" one. This includes the length and the internal bytes. `0xff, 0x05` is less than `0xff, 0x15` and `0xff` is less than `0x00, 0x00`.
- Maximum `byte_array_view` is defined conceptually as an infinite array of 0xff. This isn't possible to statically define for comparison like the `string_view` class, so some [magic values](https://github.com/rapidsai/cudf/blob/03f1c1c5c5fcf90bd594aabd41b6e15f54690777/cpp/src/io/statistics/byte_array_view.cuh#L173) were used of a nullptr and max length. These then have to be [explicitly compared](https://github.com/rapidsai/cudf/blob/03f1c1c5c5fcf90bd594aabd41b6e15f54690777/cpp/src/io/statistics/byte_array_view.cuh#L101) later in the comparison function to achieve the proper results.

Lots of places required special handling for `byte_array_view` and potentially get worse with the different possible solutions. The goal of course is to make these areas as clean as possible, so I thought it would be good to point some of them out here.

 - [Here](https://github.com/rapidsai/cudf/blob/03f1c1c5c5fcf90bd594aabd41b6e15f54690777/cpp/src/io/statistics/column_statistics.cuh#L115) is where the code grabs the data from the column. There is conversion in here for types, which is used for things like duration and timestamps. Originally it was thought this could be a good spot to convert from a `list_view`, which can be returned from `.element` calls on a list column. This didn't end up being a great solution, but I can't remember the details.
 - min/max calculations and block reduce happens down in [typed_statistics_chunk](https://github.com/rapidsai/cudf/blob/03f1c1c5c5fcf90bd594aabd41b6e15f54690777/cpp/src/io/statistics/typed_statistics_chunk.cuh#L207). This code is responsible for figuring out min, max, null counts, and aggregations like sum. It has to pick up this new type and operate on it.
 - Actual data writing in parquet looks [something like this](https://github.com/rapidsai/cudf/blob/03f1c1c5c5fcf90bd594aabd41b6e15f54690777/cpp/src/io/parquet/page_enc.cu#L1094) where an element is grabbed and written into place.

### Possible solutions
1. Use `device_span` directly. This requires passing comparison functions to cub and thrust for the calculations, but is completely doable. This was [attempted](https://github.com/hyperbolic2346/cudf/tree/mwilson/test_byte_array_view_removal), potentially poorly, with not great looking results.
2. Composition vs inheritance. This came up multiple times as to why it was built with composition, holding a `device_span` inside, vs inheriting from `device_span` either publicly or privately. There isn't a great answer here to argue against inheritance. I originally thought that this would be a very small subset of `device_span` and I didn't want to muddy the waters with all the accessors and iterators, but after further inspection, I don't see anything that I would want to remove from `device_span`, so this would be a viable path. It does still hold the issue of cognitive overload of yet another type someone encounters.
3. Continues to live on as it is now.
4. Your amazing idea that didn't come up in development or review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Should byte_array_view in parquet reader/writer change #11408

Motivation

Backstory and origin

Requirements

Slippery issues to understand

Possible solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Should byte_array_view in parquet reader/writer change #11408

Description

Motivation

Backstory and origin

Requirements

Slippery issues to understand

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions