Skip to content

[FEA] Add variable bit-width keys and improved key order for Parquet dict pages #13995

@abellina

Description

@abellina

Original title:
"[FEA] research enabling BIT_PACKED encoding for columns in parquet writer"

We have a user report of larger sizes for parquet encoded files via the GPU as opposed to Spark CPU. With their sample data, I can get a 30% increase in the GPU file size vs the CPU. I have been able to produce a single row group and the same number of files, so I am down to column encodings. The types of columns are all INT64 nullable columns.

It looks like one of the differences between the two files is that in cuDF columns are not using the BIT_PACKED encoding:

CPU (ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY):

col1:         INT64 SNAPPY DO:4 FPO:508942 SZ:856794/1169887/1.37 VC:822216 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1237, max: 1234559, num_nulls: 0]
col2:         INT64 SNAPPY DO:856798 FPO:1365736 SZ:856794/1169887/1.37 VC:822216 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1237, max: 1234559, num_nulls: 0]

GPU (ENC:PLAIN_DICTIONARY,RLE):

col1:         INT64 SNAPPY DO:4 FPO:509620 SZ:924686/1234683/1.34 VC:822216 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1237, max: 1234559, num_nulls: 0]
col2:         INT64 SNAPPY DO:924690 FPO:1434330 SZ:924742/1234683/1.34 VC:822216 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1237, max: 1234559, num_nulls: 0]

Discussing with @nvdbaranec, he suggested that BIT_PACKED could be enough of a reason for the difference. I have generated two files with my own mock data (just sequences of longs) and encoded it with the CPU and the GPU. I have placed two of the generated file in this zip file:

bit_packed_example.zip

I would appreciate any comments. If you want me to try a small change in cuDF and rebuild/retest, I am happy to do so.

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentSparkFunctionality that helps Spark RAPIDScuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    Status

    To be revisited

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions