Skip to content

Enhance ListViewArray related docs #7007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion arrow-array/src/array/list_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ impl OffsetSizeTrait for i64 {
}

/// An array of [variable length lists], similar to JSON arrays
/// (e.g. `["A", "B", "C"]`).
/// (e.g. `["A", "B", "C"]`). This struct specifically represents
/// the [list layout]. Refer to [`GenericListViewArray`] for the
/// [list-view layout].
///
/// Lists are represented using `offsets` into a `values` child
/// array. Offsets are stored in two adjacent entries of an
Expand Down Expand Up @@ -123,7 +125,10 @@ impl OffsetSizeTrait for i64 {
/// ```
///
/// [`StringArray`]: crate::array::StringArray
/// [`GenericListViewArray`]: crate::array::GenericListViewArray
/// [variable length lists]: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
/// [list layout]: https://arrow.apache.org/docs/format/Columnar.html#list-layout
/// [list-view layout]: https://arrow.apache.org/docs/format/Columnar.html#listview-layout
pub struct GenericListArray<OffsetSize: OffsetSizeTrait> {
data_type: DataType,
nulls: Option<NullBuffer>,
Expand Down
71 changes: 68 additions & 3 deletions arrow-array/src/array/list_view_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,81 @@ pub type ListViewArray = GenericListViewArray<i32>;
/// A [`GenericListViewArray`] of variable size lists, storing offsets as `i64`.
pub type LargeListViewArray = GenericListViewArray<i64>;

/// An array of [variable length lists], specifically in the [list-view layout].
///
/// Different from [`crate::GenericListArray`] as it stores both an offset and length
/// meaning that take / filter operations can be implemented without copying the underlying data.
/// Differs from [`GenericListArray`] (which represents the [list layout]) in that
/// the sizes of the child arrays are explicitly encoded in a separate buffer, instead
/// of being derived from the difference between subsequent offsets in the offset buffer.
///
/// [Variable-size List Layout: ListView Layout]: https://arrow.apache.org/docs/format/Columnar.html#listview-layout
/// This allows the offsets (and subsequently child data) to be out of order. It also
/// allows take / filter operations to be implemented without copying the underlying data.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to elaborate on this statement about take / filter operations efficiency; I just kept it verbatim as it was already there before.

///
/// # Representation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

///
/// Given the same example array from [`GenericListArray`], it would be represented
/// as such via a list-view layout array:
///
/// ```text
/// ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
/// ┌ ─ ─ ─ ─ ─ ─ ┐ │
/// ┌─────────────┐ ┌───────┐ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
/// │ [A,B,C] │ │ (0,3) │ │ 1 │ │ 0 │ │ 3 │ │ │ 1 │ │ A │ │ 0 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ [] │ │ (3,0) │ │ 1 │ │ 3 │ │ 0 │ │ │ 1 │ │ B │ │ 1 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ NULL │ │ (?,?) │ │ 0 │ │ ? │ │ ? │ │ │ 1 │ │ C │ │ 2 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ [D] │ │ (4,1) │ │ 1 │ │ 4 │ │ 1 │ │ │ ? │ │ ? │ │ 3 │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you don't need a value at index 3, list view even allows for overlapping ranges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another example which shows this in use 👍

/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ [NULL, F] │ │ (5,2) │ │ 1 │ │ 5 │ │ 2 │ │ │ 1 │ │ D │ │ 4 │
/// └─────────────┘ └───────┘ │ └───┘ └───┘ └───┘ ├───┤ ├───┤
/// │ │ 0 │ │ ? │ │ 5 │
/// Logical Logical │ Validity Offsets Sizes ├───┤ ├───┤
/// Values Offset (nulls) │ │ 1 │ │ F │ │ 6 │
/// & Size │ └───┘ └───┘
/// │ Values │ │
/// (offsets[i], │ ListViewArray (Array)
/// sizes[i]) └ ─ ─ ─ ─ ─ ─ ┘ │
/// └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
/// ```
///
/// Another way of representing the same array but taking advantage of the offsets being out of order:
///
/// ```text
/// ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
/// ┌ ─ ─ ─ ─ ─ ─ ┐ │
/// ┌─────────────┐ ┌───────┐ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
/// │ [A,B,C] │ │ (2,3) │ │ 1 │ │ 2 │ │ 3 │ │ │ 0 │ │ ? │ │ 0 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ [] │ │ (0,0) │ │ 1 │ │ 0 │ │ 0 │ │ │ 1 │ │ F │ │ 1 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ NULL │ │ (?,?) │ │ 0 │ │ ? │ │ ? │ │ │ 1 │ │ A │ │ 2 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ [D] │ │ (5,1) │ │ 1 │ │ 5 │ │ 1 │ │ │ 1 │ │ B │ │ 3 │
/// ├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
/// │ [NULL, F] │ │ (0,2) │ │ 1 │ │ 0 │ │ 2 │ │ │ 1 │ │ C │ │ 4 │
/// └─────────────┘ └───────┘ │ └───┘ └───┘ └───┘ ├───┤ ├───┤
/// │ │ 1 │ │ D │ │ 5 │
/// Logical Logical │ Validity Offsets Sizes └───┘ └───┘
/// Values Offset (nulls) │ Values │ │
/// & Size │ (Array)
/// └ ─ ─ ─ ─ ─ ─ ┘ │
/// (offsets[i], │ ListViewArray
/// sizes[i]) │
/// └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
/// ```
///
/// [`GenericListArray`]: crate::array::GenericListArray
/// [variable length lists]: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
/// [list layout]: https://arrow.apache.org/docs/format/Columnar.html#list-layout
/// [list-view layout]: https://arrow.apache.org/docs/format/Columnar.html#listview-layout
#[derive(Clone)]
pub struct GenericListViewArray<OffsetSize: OffsetSizeTrait> {
data_type: DataType,
nulls: Option<NullBuffer>,
values: ArrayRef,
// Unlike GenericListArray, we do not use OffsetBuffer here as offsets are not
// guaranteed to be monotonically increasing.
value_offsets: ScalarBuffer<OffsetSize>,
value_sizes: ScalarBuffer<OffsetSize>,
}
Expand Down
25 changes: 16 additions & 9 deletions arrow-schema/src/datatype.rs
Original file line number Diff line number Diff line change
Expand Up @@ -245,14 +245,15 @@ pub enum DataType {
///
/// # Recommendation
///
/// Users should prefer [`DataType::Date32`] to cleanly represent the number
/// Users should prefer [`Date32`] to cleanly represent the number
/// of days, or one of the Timestamp variants to include time as part of the
/// representation, depending on their use case.
///
/// # Further Reading
///
/// For more details, see [#5288](https://github.com/apache/arrow-rs/issues/5288).
///
/// [`Date32`]: Self::Date32
/// [Schema.fbs]: https://github.com/apache/arrow/blob/main/format/Schema.fbs
Date64,
/// A signed 32-bit time representing the elapsed time since midnight in the unit of `TimeUnit`.
Expand Down Expand Up @@ -282,10 +283,12 @@ pub enum DataType {
LargeBinary,
/// Opaque binary data of variable length.
///
/// Logically the same as [`Self::Binary`], but the internal representation uses a view
/// Logically the same as [`Binary`], but the internal representation uses a view
/// struct that contains the string length and either the string's entire data
/// inline (for small strings) or an inlined prefix, an index of another buffer,
/// and an offset pointing to a slice in that buffer (for non-small strings).
///
/// [`Binary`]: Self::Binary
BinaryView,
/// A variable-length string in Unicode with UTF-8 encoding.
///
Expand All @@ -299,10 +302,12 @@ pub enum DataType {
LargeUtf8,
/// A variable-length string in Unicode with UTF-8 encoding
///
/// Logically the same as [`Self::Utf8`], but the internal representation uses a view
/// Logically the same as [`Utf8`], but the internal representation uses a view
/// struct that contains the string length and either the string's entire data
/// inline (for small strings) or an inlined prefix, an index of another buffer,
/// and an offset pointing to a slice in that buffer (for non-small strings).
///
/// [`Utf8`]: Self::Utf8
Utf8View,
/// A list of some logical data type with variable length.
///
Expand All @@ -311,11 +316,12 @@ pub enum DataType {

/// (NOT YET FULLY SUPPORTED) A list of some logical data type with variable length.
///
/// Logically the same as [`List`], but the internal representation differs in how child
/// data is referenced, allowing flexibility in how data is layed out.
///
/// Note this data type is not yet fully supported. Using it with arrow APIs may result in `panic`s.
///
/// The ListView layout is defined by three buffers:
/// a validity bitmap, an offsets buffer, and an additional sizes buffer.
/// Sizes and offsets are both 32 bits for this type
/// [`List`]: Self::List
ListView(FieldRef),
/// A list of some logical data type with fixed length.
FixedSizeList(FieldRef, i32),
Expand All @@ -326,11 +332,12 @@ pub enum DataType {

/// (NOT YET FULLY SUPPORTED) A list of some logical data type with variable length and 64-bit offsets.
///
/// Logically the same as [`LargeList`], but the internal representation differs in how child
/// data is referenced, allowing flexibility in how data is layed out.
///
/// Note this data type is not yet fully supported. Using it with arrow APIs may result in `panic`s.
///
/// The LargeListView layout is defined by three buffers:
/// a validity bitmap, an offsets buffer, and an additional sizes buffer.
/// Sizes and offsets are both 64 bits for this type
/// [`LargeList`]: Self::LargeList
LargeListView(FieldRef),
/// A nested datatype that contains a number of sub-fields.
Struct(Fields),
Expand Down
Loading