Skip to content

[parquet exporter] use more memory efficient types when creating placeholder columns #1034

@albertlockett

Description

@albertlockett

In #1024 we added support for handling the optional columns that may be missing due to the adaptive OTAP schema by creating placeholder columns containing entirely null/default values (depending on the column nullability).

However, this is less memory efficient than it could be because we materialize a full-length array of the native Arrow type for each missing column.

When REE is supported by Parquet, we should be able to change how the placeholder columns are are generated. We can create a RunArray with length 1 instead:

// generate placeholders like:
let placeholder_val = if field.is_nullable() {
    UInt32Array::new_null(1);
} else {
    UInt32Array::from_iter_values([1]);
}
let run_ends = Int32Array::from_iter_values([num_rows]);
let placeholder_arr = RunArray::try_new(&run_ends, &all_nulls).unwrap();

We'd probably be able to reuse the single-length value array for between batches as well, so it might be worth investigating creating static instances for each column type using LazyLock.

Adding REE support in Parquet will need to happen first:

Metadata

Metadata

Assignees

No one assigned

    Labels

    apache-arrowLow level Apache Arrow tasksparquet-exporterParquet Exporter related tasksperf-optPerformance and optimizationrustPull requests that update Rust code

    Type

    No type

    Projects

    Status

    Priority 3

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions