Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jan 7, 2026

Which issue does this PR close?

Rationale for this change

The current implementation of make_array for StructArray and GenericByteViewArray clones ArrayData which allocates a new Vec. This is unnecessary given that make_array is passed an owned ArrayData

What changes are included in this PR?

  1. Add a new API to ArrayData to break it down into parts (into_parts)
  2. Use that API to avoid cloning while constructing StructArray and GenericByteViewArray

Are these changes tested?

Yes by CI

Are there any user-facing changes?

A few fewer allocations when creating arrays

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 7, 2026
fn from(value: ArrayData) -> Self {
let views = value.buffers()[0].clone();
let views = ScalarBuffer::new(views, value.offset(), value.len());
let buffers = value.buffers()[1..].to_vec();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call to_vec() allocates a new Vec which is unecessary


impl<T: ByteViewType + ?Sized> From<ArrayData> for GenericByteViewArray<T> {
fn from(value: ArrayData) -> Self {
let views = value.buffers()[0].clone();
Copy link
Contributor Author

@alamb alamb Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloneing the buffers is relatively cheap (they are Arcd internally) so avoiding this just makes the code easier to follow, I don't think it will be any significant performance savings

make_array(cd.slice(parent_offset, parent_len))
} else {
make_array(cd.clone())
make_array(cd)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd.clone() clones the ArrayData (and allocates a new Vec) for each child, recursively, which is unecessary

.into_iter()
.map(|cd| {
if parent_offset != 0 || parent_len != cd.len() {
make_array(cd.slice(parent_offset, parent_len))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably avoid an additional allocation for sliced arrays by making a version of slice() that consumes self -- like sliced() perhaps 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Jan 9, 2026

Thanks @mhilton

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

@alamb
Copy link
Contributor Author

alamb commented Jan 11, 2026

BTW I plan to make the same corresponding change for other array types (see #9058) but I was making multiple PRs to reduce the review load

@alamb alamb merged commit 237065b into apache:main Jan 12, 2026
26 checks passed
@scovich
Copy link
Contributor

scovich commented Jan 12, 2026

@alamb -- post-merge reply to a resolved comment here (in case github didn't make it easy to find):
#9114 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants