Skip to content

[QST] Should byte_array_view in parquet reader/writer change #11408

@hyperbolic2346

Description

@hyperbolic2346

What is your question?
Should byte_array_view change to a different implementation method or even go away completely.

Motivation

When reviewing the byte_array_view PR it was brought up in review comments that things could be done differently and possibly better. This issue is an attempt to bring this design out in the light and get some discourse going so we can build it the best way possible. Jake was, rightfully, concerned about the cognitive overload of having another object type that has to be understood, no matter how minimal the type turns out to be.

Backstory and origin

The original thought was that it would be nice to leverage the existing templates in the statistics code to get elements and compute max/min just like everything else. This meant that .element on a column would be able to return a type that represents a list<uint8>. This is almost identical to a string column, so the thought was to have something analogous to string_view that could be used. This was quickly dismissed due to the issue of not having all list columns comprised of this thing and it felt like we were forcing something. All string columns are lists of chars, but not all list columns are lists of bytes.

Requirements

The requirements in the statistics code are the ability to get an element from a table, compare elements, and compose an element from a pointer and a length. The statistics code goes to great length to type-erase the statistics blobs so they can be easily consumed at a large scale on the GPU and the reconstructs them later. It also uses thrust::min and cub::reduceBlock to process them, so comparison operators are needed.

Slippery issues to understand

We can't use the same statistics types as strings because string_view::max() is actually not the same as a max byte or a max byte_array_view. The distinction is subtle, but important between all of them.

  • The max UTF8 string is actually just 5 bytes long and defined inside the string_view header. No UTF8 string can have a higher value, so comparisons work even though it isn't an infinitely-long character string as one would initially think.
  • Maximum value for an unsigned byte is obviously 255, but this isn't the what is intended when one asks for the max byte array view. Instead, the goal is to know the "biggest" one. This includes the length and the internal bytes. 0xff, 0x05 is less than 0xff, 0x15 and 0xff is less than 0x00, 0x00.
  • Maximum byte_array_view is defined conceptually as an infinite array of 0xff. This isn't possible to statically define for comparison like the string_view class, so some magic values were used of a nullptr and max length. These then have to be explicitly compared later in the comparison function to achieve the proper results.

Lots of places required special handling for byte_array_view and potentially get worse with the different possible solutions. The goal of course is to make these areas as clean as possible, so I thought it would be good to point some of them out here.

  • Here is where the code grabs the data from the column. There is conversion in here for types, which is used for things like duration and timestamps. Originally it was thought this could be a good spot to convert from a list_view, which can be returned from .element calls on a list column. This didn't end up being a great solution, but I can't remember the details.
  • min/max calculations and block reduce happens down in typed_statistics_chunk. This code is responsible for figuring out min, max, null counts, and aggregations like sum. It has to pick up this new type and operate on it.
  • Actual data writing in parquet looks something like this where an element is grabbed and written into place.

Possible solutions

  1. Use device_span directly. This requires passing comparison functions to cub and thrust for the calculations, but is completely doable. This was attempted, potentially poorly, with not great looking results.
  2. Composition vs inheritance. This came up multiple times as to why it was built with composition, holding a device_span inside, vs inheriting from device_span either publicly or privately. There isn't a great answer here to argue against inheritance. I originally thought that this would be a very small subset of device_span and I didn't want to muddy the waters with all the accessors and iterators, but after further inspection, I don't see anything that I would want to remove from device_span, so this would be a viable path. It does still hold the issue of cognitive overload of yet another type someone encounters.
  3. Continues to live on as it is now.
  4. Your amazing idea that didn't come up in development or review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuIOcuIO issuelibcudfAffects libcudf (C++/CUDA) code.questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions