Skip to content

[thrift-remodel] Optimize convert_row_groups #8517

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I ran into this while working on the benchmark here

I noticed a substantial amount of time (15% of the overall time) in the benchmark was spent in convert_row_groups

fn convert_row_groups(
mut row_groups: Vec<RowGroup>,
schema_descr: Arc<SchemaDescriptor>,
) -> Result<Vec<RowGroupMetaData>> {
let mut res: Vec<RowGroupMetaData> = Vec::with_capacity(row_groups.len());
for rg in row_groups.drain(0..) {
res.push(convert_row_group(rg, schema_descr.clone())?);
}
Ok(res)
}

Image

Describe the solution you'd like
I think that code could likely be optimized.

Describe alternatives you've considered
Two obvious candidates:

  1. Use the into_iter() / collect pattern to map the results (which is highly optimized in Rust)
  2. Don't clone the Arc<SchemaDescriptor> -- I think it only needs a reference

Another thing would be to decode directly into RowGroupMetaData somehow (maybe make RowGRoupMetaData a view on an inner RowGroup 🤔

struct RowGroupMetaData {
  inner: RowGroup
}

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crateperformance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions