Skip to content

Commit 3eb4acf

Browse files
sm4rtm4artMartinalamb
authored andcommitted
"Gentle Introduction to Arrow / Record Batches" apache#11336 (apache#18051)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#11336 Since this is my first contribution, I suppose to mention @alamb , author of the Issue apache#11336 Could you please trigger the CI? Thanks! ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The Arrow introduction guide (apache#11336) needed improvements to make it more accessible for newcomers while providing better navigation to advanced topics. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Issue apache#11336 requested a gentle introduction to Apache Arrow and RecordBatches to help DataFusion users understand the foundational concepts. This PR enhances the existing Arrow introduction guide with clearer explanations, practical examples, visual aids, and comprehensive navigation links to make it more accessible for newcomers while providing pathways to advanced topics. Was unsure if this fits to `docs/source/user-guide/dataframe.md' ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> applyed prettier, like described. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Yes - improved documentation for the Arrow introduction guide at `docs/source/user-guide/arrow-introduction.md` <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
1 parent 47916f2 commit 3eb4acf

File tree

4 files changed

+288
-1
lines changed

4 files changed

+288
-1
lines changed

datafusion/core/src/lib.rs

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -443,7 +443,30 @@
443443
//! other operators read a single [`RecordBatch`] from their input to produce a
444444
//! single [`RecordBatch`] as output.
445445
//!
446-
//! For example, given this SQL query:
446+
//! For example, given this SQL:
447+
//!
448+
//! ```sql
449+
//! SELECT name FROM 'data.parquet' WHERE id > 10
450+
//! ```
451+
//!
452+
//! An simplified DataFusion execution plan is shown below. It first reads
453+
//! data from the Parquet file, then applies the filter, then the projection,
454+
//! and finally produces output. Each step processes one [`RecordBatch`] at a
455+
//! time. Multiple batches are processed concurrently on different CPU cores
456+
//! for plans with multiple partitions.
457+
//!
458+
//! ```text
459+
//! ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────────┐ ┌──────────┐
460+
//! │ Parquet │───▶│ DataSource │───▶│ FilterExec │───▶│ ProjectionExec │───▶│ Results │
461+
//! │ File │ │ │ │ │ │ │ │ │
462+
//! └─────────────┘ └──────────────┘ └────────────────┘ └──────────────────┘ └──────────┘
463+
//! (reads data) (id > 10) (keeps "name" col)
464+
//! RecordBatch ───▶ RecordBatch ────▶ RecordBatch ────▶ RecordBatch
465+
//! ```
466+
//!
467+
//! DataFusion uses the classic "pull" based control flow (explained more in the
468+
//! next section) to implement streaming execution. As an example,
469+
//! consider the following SQL query:
447470
//!
448471
//! ```sql
449472
//! SELECT date_trunc('month', time) FROM data WHERE id IN (10,20,30);
@@ -897,6 +920,12 @@ doc_comment::doctest!("../../../README.md", readme_example_test);
897920
// For example, if `user_guide_expressions(line 123)` fails,
898921
// go to `docs/source/user-guide/expressions.md` to find the relevant problem.
899922
//
923+
#[cfg(doctest)]
924+
doc_comment::doctest!(
925+
"../../../docs/source/user-guide/arrow-introduction.md",
926+
user_guide_arrow_introduction
927+
);
928+
900929
#[cfg(doctest)]
901930
doc_comment::doctest!(
902931
"../../../docs/source/user-guide/concepts-readings-events.md",

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ To get started, see
118118
user-guide/crate-configuration
119119
user-guide/cli/index
120120
user-guide/dataframe
121+
user-guide/arrow-introduction
121122
user-guide/expressions
122123
user-guide/sql/index
123124
user-guide/configs
Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Gentle Arrow Introduction
21+
22+
```{contents}
23+
:local:
24+
:depth: 2
25+
```
26+
27+
## Overview
28+
29+
DataFusion uses [Apache Arrow] as its native in-memory format, so anyone using DataFusion will likely interact with Arrow at some point. This guide introduces the key Arrow concepts you need to know to effectively use DataFusion.
30+
31+
Apache Arrow defines a standardized columnar representation for in-memory data. This enables different systems and languages (e.g., Rust and Python) to share data with zero-copy interchange, avoiding serialization overhead. In addition to zero copy interchange, Arrow also standardizes best practice columnar data representation enabling high performance analytical processing through vectorized execution.
32+
33+
## Columnar Layout
34+
35+
Quick visual: row-major (left) vs Arrow's columnar layout (right). For a deeper primer, see the [arrow2 guide].
36+
37+
```text
38+
Traditional Row Storage: Arrow Columnar Storage:
39+
┌──────────────────┐ ┌─────────┬─────────┬──────────┐
40+
│ id │ name │ age │ │ id │ name │ age │
41+
├────┼──────┼──────┤ ├─────────┼─────────┼──────────┤
42+
│ 1 │ A │ 30 │ │ [1,2,3] │ [A,B,C] │[30,25,35]│
43+
│ 2 │ B │ 25 │ └─────────┴─────────┴──────────┘
44+
│ 3 │ C │ 35 │ ↑ ↑ ↑
45+
└──────────────────┘ Int32Array StringArray Int32Array
46+
(read entire rows) (process entire columns at once)
47+
```
48+
49+
## `RecordBatch`
50+
51+
Arrow's standard unit for packaging data is the **[`RecordBatch`]**.
52+
53+
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays that conform to a defined schema. Each column within the slice is a contiguous Arrow array, and all columns have the same number of rows (length). This chunked, immutable unit enables efficient streaming and parallel execution.
54+
55+
Think of it as having two perspectives:
56+
57+
- **Columnar inside**: Each column (`id`, `name`, `age`) is a contiguous array optimized for vectorized operations
58+
- **Row-chunked externally**: The batch represents a chunk of rows (e.g., rows 1-1000), making it a manageable unit for streaming
59+
60+
RecordBatches are **immutable snapshots**—once created, they cannot be modified. Any transformation produces a _new_ RecordBatch, enabling safe parallel processing without locks or coordination overhead.
61+
62+
This design allows DataFusion to process streams of row-based chunks while gaining maximum performance from the columnar layout.
63+
64+
## Streaming Through the Engine
65+
66+
DataFusion processes queries as pull-based pipelines where operators request batches from their inputs. This streaming approach enables early result production, bounds memory usage (spilling to disk only when necessary), and naturally supports parallel execution across multiple CPU cores.
67+
68+
For example, given the following query:
69+
70+
```sql
71+
SELECT name FROM 'data.parquet' WHERE id > 10
72+
```
73+
74+
The DataFusion Pipeline looks like this:
75+
76+
```text
77+
78+
┌─────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────────┐ ┌──────────┐
79+
│ Parquet │───▶│ Scan │───▶│ Filter │───▶│ Projection │───▶│ Results │
80+
│ File │ │ Operator │ │ Operator │ │ Operator │ │ │
81+
└─────────────┘ └──────────────┘ └────────────────┘ └──────────────────┘ └──────────┘
82+
(reads data) (id > 10) (keeps "name" col)
83+
RecordBatch ───▶ RecordBatch ────▶ RecordBatch ────▶ RecordBatch
84+
```
85+
86+
In this pipeline, [`RecordBatch`]es are the "packages" of columnar data that flow between the different stages of query execution. Each operator processes batches incrementally, enabling the system to produce results before reading the entire input.
87+
88+
## Creating `ArrayRef` and `RecordBatch`es
89+
90+
Sometimes you need to create Arrow data programmatically rather than reading from files.
91+
92+
The first thing needed is creating an Arrow Array, for each column. [arrow-rs] provides array builders and `From` impls to create arrays from Rust vectors.
93+
94+
```rust
95+
use arrow::array::{StringArray, Int32Array};
96+
// Create an Int32Array from a vector of i32 values
97+
let ids = Int32Array::from(vec![1, 2, 3]);
98+
// There are similar constructors for other array types, e.g., StringArray, Float64Array, etc.
99+
let names = StringArray::from(vec![Some("alice"), None, Some("carol")]);
100+
```
101+
102+
Every element in an Arrow array can be "null" (aka missing). Often, arrays are
103+
created from `Option<T>` values to indicate nullability (e.g., `Some("alice")`
104+
vs `None` above).
105+
106+
Note: You'll see [`Arc`] used frequently in the code—Arrow arrays are wrapped in
107+
[`Arc`] (atomically reference-counted pointers) to enable cheap, thread-safe
108+
sharing across operators and tasks. [`ArrayRef`] is simply a type alias for
109+
`Arc<dyn Array>`. To create an `ArrayRef`, wrap your array in `Arc::new(...)` as shown below.
110+
111+
```rust
112+
use std::sync::Arc;
113+
# use arrow::array::{ArrayRef, Int32Array, StringArray};
114+
// To get an ArrayRef, wrap the Int32Array in an Arc.
115+
// (note you will often have to explicitly type annotate to ArrayRef)
116+
let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
117+
118+
// you can also store Strings and other types in ArrayRefs
119+
let arr: ArrayRef = Arc::new(
120+
StringArray::from(vec![Some("alice"), None, Some("carol")])
121+
);
122+
```
123+
124+
To create a [`RecordBatch`], you need to define its [`Schema`] (the column names and types) and provide the corresponding columns as [`ArrayRef`]s as shown below:
125+
126+
```rust
127+
# use std::sync::Arc;
128+
# use arrow_schema::ArrowError;
129+
# use arrow::array::{ArrayRef, Int32Array, StringArray, RecordBatch};
130+
use arrow_schema::{DataType, Field, Schema};
131+
132+
// Create the columns as Arrow arrays
133+
let ids = Int32Array::from(vec![1, 2, 3]);
134+
let names = StringArray::from(vec![Some("alice"), None, Some("carol")]);
135+
// Create the schema
136+
let schema = Arc::new(Schema::new(vec![
137+
Field::new("id", DataType::Int32, false), // false means non-nullable
138+
Field::new("name", DataType::Utf8, true), // true means nullable
139+
]));
140+
// Assemble the columns
141+
let cols: Vec<ArrayRef> = vec![
142+
Arc::new(ids),
143+
Arc::new(names)
144+
];
145+
// Finally, create the RecordBatch
146+
RecordBatch::try_new(schema, cols).expect("Failed to create RecordBatch");
147+
```
148+
149+
## Working with `ArrayRef` and `RecordBatch`
150+
151+
Most DataFusion APIs are in terms of [`ArrayRef`] and [`RecordBatch`]. To work with the
152+
underlying data, you typically downcast the [`ArrayRef`] to its concrete type
153+
(e.g., [`Int32Array`]).
154+
155+
To do so either use the `as_any().downcast_ref::<T>()` method or the
156+
`as_::<T>()` helper method from the [AsArray] trait.
157+
158+
[asarray]: https://docs.rs/arrow-array/latest/arrow_array/cast/trait.AsArray.html
159+
160+
```rust
161+
# use std::sync::Arc;
162+
# use arrow::datatypes::{DataType, Int32Type};
163+
# use arrow::array::{AsArray, ArrayRef, Int32Array, RecordBatch};
164+
# let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
165+
// First check the data type of the array
166+
match arr.data_type() {
167+
&DataType::Int32 => {
168+
// Downcast to Int32Array
169+
let int_array = arr.as_primitive::<Int32Type>();
170+
// Now you can access Int32Array methods
171+
for i in 0..int_array.len() {
172+
println!("Value at index {}: {}", i, int_array.value(i));
173+
}
174+
}
175+
_ => {
176+
println ! ("Array is not of type Int32");
177+
}
178+
}
179+
```
180+
181+
The following two downcasting methods are equivalent:
182+
183+
```rust
184+
# use std::sync::Arc;
185+
# use arrow::datatypes::{DataType, Int32Type};
186+
# use arrow::array::{AsArray, ArrayRef, Int32Array, RecordBatch};
187+
# let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
188+
// Downcast to Int32Array using as_any
189+
let int_array1 = arr.as_any().downcast_ref::<Int32Array>().unwrap();
190+
// This is the same as using the as_::<T>() helper
191+
let int_array2 = arr.as_primitive::<Int32Type>();
192+
assert_eq!(int_array1, int_array2);
193+
```
194+
195+
## Common Pitfalls
196+
197+
When working with Arrow and RecordBatches, watch out for these common issues:
198+
199+
- **Schema consistency**: All batches in a stream must share the exact same [`Schema`]. For example, you can't have one batch where a column is [`Int32`] and the next where it's [`Int64`], even if the values would fit
200+
- **Immutability**: Arrays are immutable—to "modify" data, you must build new arrays or new RecordBatches. For instance, to change a value in an array, you'd create a new array with the updated value
201+
- **Row by Row Processing**: Avoid iterating over Arrays element by element when possible, and use Arrow's built-in [compute kernels] instead
202+
- **Type mismatches**: Mixed input types across files may require explicit casts. For example, a string column `"123"` from a CSV file won't automatically join with an integer column `123` from a Parquet file—you'll need to cast one to match the other. Use Arrow's [`cast`] kernel where appropriate
203+
- **Batch size assumptions**: Don't assume a particular batch size; always iterate until the stream ends. One file might produce 8192-row batches while another produces 1024-row batches
204+
205+
[compute kernels]: https://docs.rs/arrow/latest/arrow/compute/index.html
206+
207+
## Further reading
208+
209+
**Arrow Documentation:**
210+
211+
- [Arrow Format Introduction](https://arrow.apache.org/docs/format/Intro.html) - Understand the Arrow specification and why it enables zero-copy data sharing
212+
- [Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html) - Deep dive into memory layout for performance optimization
213+
- [Arrow Rust Documentation](https://docs.rs/arrow/latest/arrow/) - Complete API reference for the Rust implementation
214+
215+
**Key API References:**
216+
217+
- [RecordBatch](https://docs.rs/arrow-array/latest/arrow_array/struct.RecordBatch.html) - The fundamental data structure for columnar data (a table slice)
218+
- [ArrayRef](https://docs.rs/arrow-array/latest/arrow_array/array/type.ArrayRef.html) - Represents a reference-counted Arrow array (single column)
219+
- [DataType](https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html) - Enum of all supported Arrow data types (e.g., Int32, Utf8)
220+
- [Schema](https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html) - Describes the structure of a RecordBatch (column names and types)
221+
222+
[apache arrow]: https://arrow.apache.org/docs/index.html
223+
[`arc`]: https://doc.rust-lang.org/std/sync/struct.Arc.html
224+
[`arrayref`]: https://docs.rs/arrow-array/latest/arrow_array/array/type.ArrayRef.html
225+
[`cast`]: https://docs.rs/arrow/latest/arrow/compute/fn.cast.html
226+
[`field`]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Field.html
227+
[`schema`]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html
228+
[`datatype`]: https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html
229+
[`int32array`]: https://docs.rs/arrow-array/latest/arrow_array/array/struct.Int32Array.html
230+
[`stringarray`]: https://docs.rs/arrow-array/latest/arrow_array/array/struct.StringArray.html
231+
[`int32`]: https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html#variant.Int32
232+
[`int64`]: https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html#variant.Int64
233+
[extension points]: ../library-user-guide/extensions.md
234+
[`tableprovider`]: https://docs.rs/datafusion/latest/datafusion/datasource/trait.TableProvider.html
235+
[custom table providers guide]: ../library-user-guide/custom-table-providers.md
236+
[user-defined functions (udfs)]: ../library-user-guide/functions/adding-udfs.md
237+
[custom optimizer rules and physical operators]: ../library-user-guide/extending-operators.md
238+
[`executionplan`]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
239+
[`.register_table()`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.register_table
240+
[`.sql()`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql
241+
[`.show()`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.show
242+
[`memtable`]: https://docs.rs/datafusion/latest/datafusion/datasource/struct.MemTable.html
243+
[`sessioncontext`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html
244+
[`csvreadoptions`]: https://docs.rs/datafusion/latest/datafusion/execution/options/struct.CsvReadOptions.html
245+
[`parquetreadoptions`]: https://docs.rs/datafusion/latest/datafusion/execution/options/struct.ParquetReadOptions.html
246+
[`recordbatch`]: https://docs.rs/arrow-array/latest/arrow_array/struct.RecordBatch.html
247+
[`read_csv`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_csv
248+
[`read_parquet`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_parquet
249+
[`read_json`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_json
250+
[`read_avro`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_avro
251+
[`dataframe`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html
252+
[`.collect()`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect
253+
[arrow2 guide]: https://jorgecarleitao.github.io/arrow2/main/guide/arrow.html#what-is-apache-arrow
254+
[configuration settings]: configs.md
255+
[`datafusion.execution.batch_size`]: configs.md#setting-configuration-options

docs/source/user-guide/dataframe.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919

2020
# DataFrame API
2121

22+
## DataFrame overview
23+
2224
A DataFrame represents a logical set of rows with the same named columns,
2325
similar to a [Pandas DataFrame] or [Spark DataFrame].
2426

0 commit comments

Comments
 (0)