[parquet exporter] use more memory efficient types when creating placeholder columns

In #1024 we added support for handling the optional columns that may be missing due to the adaptive OTAP schema by creating placeholder columns containing entirely null/default values (depending on the column nullability).

However, this is less memory efficient than it could be because we materialize a full-length array of the native Arrow type for each missing column.

When REE is supported by Parquet, we should be able to change how the placeholder columns are are generated. We can create a RunArray with length 1 instead:
```rs
// generate placeholders like:
let placeholder_val = if field.is_nullable() {
    UInt32Array::new_null(1);
} else {
    UInt32Array::from_iter_values([1]);
}
let run_ends = Int32Array::from_iter_values([num_rows]);
let placeholder_arr = RunArray::try_new(&run_ends, &all_nulls).unwrap();
```
We'd probably be able to reuse the single-length value array for between batches as well, so it might be worth investigating creating static instances for each column type using `LazyLock`.

Adding REE support in Parquet will need to happen first:
- [ ] https://github.com/apache/arrow-rs/pull/8069
- [ ] We'll also need to allow for `RunEndEncoding` to be considerd a logically compatible data type in the writer, similar to what was done in https://github.com/apache/arrow-rs/pull/8095


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet exporter] use more memory efficient types when creating placeholder columns #1034

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[parquet exporter] use more memory efficient types when creating placeholder columns #1034

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions