-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience
#8790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
alamb
merged 1 commit into
apache:main
from
jonded94:implement-pyarrow-table-convenience-class
Dec 5, 2025
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,17 +44,20 @@ | |
| //! | `pyarrow.Array` | [ArrayData] | | ||
| //! | `pyarrow.RecordBatch` | [RecordBatch] | | ||
| //! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn RecordBatchReader + Send>` (1) | | ||
| //! | `pyarrow.Table` | [Table] (2) | | ||
| //! | ||
| //! (1) `pyarrow.RecordBatchReader` can be imported as [ArrowArrayStreamReader]. Either | ||
| //! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be exported | ||
| //! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is typically | ||
| //! easier to create.) | ||
| //! | ||
| //! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't | ||
| //! have these same concepts. A chunked table is instead represented with | ||
| //! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling | ||
| //! [pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader) | ||
| //! and then importing the reader as a [ArrowArrayStreamReader]. | ||
| //! (2) Although arrow-rs offers [Table], a convenience wrapper for [pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table) | ||
| //! that internally holds `Vec<RecordBatch>`, it is meant primarily for use cases where you already | ||
| //! have `Vec<RecordBatch>` on the Rust side and want to export that in bulk as a `pyarrow.Table`. | ||
| //! In general, it is recommended to use streaming approaches instead of dealing with data in bulk. | ||
| //! For example, a `pyarrow.Table` (or any other object that implements the ArrayStream PyCapsule | ||
| //! interface) can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>` instead of | ||
| //! forcing eager reading into `Vec<RecordBatch>`. | ||
|
|
||
| use std::convert::{From, TryFrom}; | ||
| use std::ptr::{addr_of, addr_of_mut}; | ||
|
|
@@ -68,13 +71,13 @@ use arrow_array::{ | |
| make_array, | ||
| }; | ||
| use arrow_data::ArrayData; | ||
| use arrow_schema::{ArrowError, DataType, Field, Schema}; | ||
| use arrow_schema::{ArrowError, DataType, Field, Schema, SchemaRef}; | ||
| use pyo3::exceptions::{PyTypeError, PyValueError}; | ||
| use pyo3::ffi::Py_uintptr_t; | ||
| use pyo3::import_exception; | ||
| use pyo3::prelude::*; | ||
| use pyo3::pybacked::PyBackedStr; | ||
| use pyo3::types::{PyCapsule, PyList, PyTuple}; | ||
| use pyo3::types::{PyCapsule, PyDict, PyList, PyTuple}; | ||
| use pyo3::{import_exception, intern}; | ||
|
|
||
| import_exception!(pyarrow, ArrowException); | ||
| /// Represents an exception raised by PyArrow. | ||
|
|
@@ -484,6 +487,100 @@ impl IntoPyArrow for ArrowArrayStreamReader { | |
| } | ||
| } | ||
|
|
||
| /// This is a convenience wrapper around `Vec<RecordBatch>` that tries to simplify conversion from | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not familiar enough with how the python interface works to know if this is reasonable or not. Perhaps @kylebarron can help review this part |
||
| /// and to `pyarrow.Table`. | ||
| /// | ||
| /// This could be used in circumstances where you either want to consume a `pyarrow.Table` directly | ||
| /// (although technically, since `pyarrow.Table` implements the ArrayStreamReader PyCapsule | ||
| /// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` instead) or, more | ||
| /// importantly, where one wants to export a `pyarrow.Table` from a `Vec<RecordBatch>` from the Rust | ||
| /// side. | ||
| /// | ||
| /// ```ignore | ||
| /// #[pyfunction] | ||
| /// fn return_table(...) -> PyResult<PyArrowType<Table>> { | ||
| /// let batches: Vec<RecordBatch>; | ||
| /// let schema: SchemaRef; | ||
| /// PyArrowType(Table::try_new(batches, schema).map_err(|err| err.into_py_err(py))?) | ||
| /// } | ||
| /// ``` | ||
| #[derive(Clone)] | ||
| pub struct Table { | ||
| record_batches: Vec<RecordBatch>, | ||
| schema: SchemaRef, | ||
| } | ||
|
|
||
| impl Table { | ||
| pub fn try_new( | ||
| record_batches: Vec<RecordBatch>, | ||
| schema: SchemaRef, | ||
| ) -> Result<Self, ArrowError> { | ||
| for record_batch in &record_batches { | ||
| if schema != record_batch.schema() { | ||
| return Err(ArrowError::SchemaError(format!( | ||
| "All record batches must have the same schema. \ | ||
| Expected schema: {:?}, got schema: {:?}", | ||
| schema, | ||
| record_batch.schema() | ||
| ))); | ||
| } | ||
| } | ||
| Ok(Self { | ||
| record_batches, | ||
| schema, | ||
| }) | ||
| } | ||
|
|
||
| pub fn record_batches(&self) -> &[RecordBatch] { | ||
| &self.record_batches | ||
| } | ||
|
|
||
| pub fn schema(&self) -> SchemaRef { | ||
| self.schema.clone() | ||
| } | ||
|
|
||
| pub fn into_inner(self) -> (Vec<RecordBatch>, SchemaRef) { | ||
| (self.record_batches, self.schema) | ||
| } | ||
| } | ||
|
|
||
| impl TryFrom<Box<dyn RecordBatchReader>> for Table { | ||
| type Error = ArrowError; | ||
|
|
||
| fn try_from(value: Box<dyn RecordBatchReader>) -> Result<Self, ArrowError> { | ||
| let schema = value.schema(); | ||
| let batches = value.collect::<Result<Vec<_>, _>>()?; | ||
| Self::try_new(batches, schema) | ||
| } | ||
| } | ||
|
|
||
| /// Convert a `pyarrow.Table` (or any other ArrowArrayStream compliant object) into [`Table`] | ||
| impl FromPyArrow for Table { | ||
| fn from_pyarrow_bound(ob: &Bound<PyAny>) -> PyResult<Self> { | ||
| let reader: Box<dyn RecordBatchReader> = | ||
| Box::new(ArrowArrayStreamReader::from_pyarrow_bound(ob)?); | ||
| Self::try_from(reader).map_err(|err| PyErr::new::<PyValueError, _>(err.to_string())) | ||
| } | ||
| } | ||
|
|
||
| /// Convert a [`Table`] into `pyarrow.Table`. | ||
| impl IntoPyArrow for Table { | ||
| fn into_pyarrow(self, py: Python) -> PyResult<Bound<PyAny>> { | ||
| let module = py.import(intern!(py, "pyarrow"))?; | ||
| let class = module.getattr(intern!(py, "Table"))?; | ||
|
|
||
| let py_batches = PyList::new(py, self.record_batches.into_iter().map(PyArrowType))?; | ||
| let py_schema = PyArrowType(Arc::unwrap_or_clone(self.schema)); | ||
|
|
||
| let kwargs = PyDict::new(py); | ||
| kwargs.set_item("schema", py_schema)?; | ||
|
|
||
| let reader = class.call_method("from_batches", (py_batches,), Some(&kwargs))?; | ||
|
|
||
| Ok(reader) | ||
| } | ||
| } | ||
|
|
||
| /// A newtype wrapper for types implementing [`FromPyArrow`] or [`IntoPyArrow`]. | ||
| /// | ||
| /// When wrapped around a type `T: FromPyArrow`, it | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually have to subclass from the prototype; the type checker will automatically check for structural type equality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One doesn't have to subtype a
typing.Protocol, yes (this is the whole idea behind it, i.e. not doing static typing but structural typing, which can allow consuming external objects that don't have to inherit directly from your classes as long as they conform to a certain pattern).But in cases where you anyways have strong control over your own classes, I find it highly beneficial to always inherit directly from the Protocol if possible. This has the advantage that it will move the detection of type mismatches to the place where you defined your class, instead of requiring you to make sure that you used all classes in business logic where objects conforming to a certain protocol are expected. Also, I saw a bit over the years that with very intricate Protocols, subtle type errors sometimes can be caught a bit more reliably with existing Python type checkers when directly inheriting from a Protocol, but this shouldn't be really relevant here I think.
Besides that, I sadly don't think that there anyways are type checks actually running in the CI or so 😅 I think the only thing done here is to compile the Python package and run the tests. There probably should be another PR introducing some type checking with
mypy --strictor so.