-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crateChanges to the parquet crateperformance
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I ran into this while working on the benchmark here
I noticed a substantial amount of time (15% of the overall time) in the benchmark was spent in convert_row_groups
arrow-rs/parquet/src/file/metadata/thrift_gen.rs
Lines 247 to 257 in b4b4d26
fn convert_row_groups( | |
mut row_groups: Vec<RowGroup>, | |
schema_descr: Arc<SchemaDescriptor>, | |
) -> Result<Vec<RowGroupMetaData>> { | |
let mut res: Vec<RowGroupMetaData> = Vec::with_capacity(row_groups.len()); | |
for rg in row_groups.drain(0..) { | |
res.push(convert_row_group(rg, schema_descr.clone())?); | |
} | |
Ok(res) | |
} |

Describe the solution you'd like
I think that code could likely be optimized.
Describe alternatives you've considered
Two obvious candidates:
- Use the
into_iter()
/collect
pattern to map the results (which is highly optimized in Rust) - Don't clone the
Arc<SchemaDescriptor>
-- I think it only needs a reference
Another thing would be to decode directly into RowGroupMetaData somehow (maybe make RowGRoupMetaData a view on an inner RowGroup 🤔
struct RowGroupMetaData {
inner: RowGroup
}
Additional context
Metadata
Metadata
Assignees
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crateChanges to the parquet crateperformance