-
Notifications
You must be signed in to change notification settings - Fork 71
Description
In #1024 we added support for handling the optional columns that may be missing due to the adaptive OTAP schema by creating placeholder columns containing entirely null/default values (depending on the column nullability).
However, this is less memory efficient than it could be because we materialize a full-length array of the native Arrow type for each missing column.
When REE is supported by Parquet, we should be able to change how the placeholder columns are are generated. We can create a RunArray with length 1 instead:
// generate placeholders like:
let placeholder_val = if field.is_nullable() {
UInt32Array::new_null(1);
} else {
UInt32Array::from_iter_values([1]);
}
let run_ends = Int32Array::from_iter_values([num_rows]);
let placeholder_arr = RunArray::try_new(&run_ends, &all_nulls).unwrap();We'd probably be able to reuse the single-length value array for between batches as well, so it might be worth investigating creating static instances for each column type using LazyLock.
Adding REE support in Parquet will need to happen first:
- Support writing RunEndEncoded as Parquet apache/arrow-rs#8069
- We'll also need to allow for
RunEndEncodingto be considerd a logically compatible data type in the writer, similar to what was done in [parquet] further improve logical type compatibility in ArrowWriter apache/arrow-rs#8095
Metadata
Metadata
Assignees
Labels
Type
Projects
Status