-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience
#8790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience
#8790
Conversation
kylebarron
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.
I don't know what the stance of maintainers is towards including a Table construct for python integration.
FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)
|
Thanks @kylebarron for your very quick review! ❤️
Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in
At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Slightly nicer Python workflowIn our very specific example, we have a Python class with a function such as this one: class ParquetFile:
def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...In the issue I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single The latter comes with a bit of syntactic shortcomings in contexts where you want to apply rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]]
if isinstance(rg, pyarrow.RecordBatch):
python_objs = rg.to_pylist()
else:
python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg))With rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]] = rg.to_pylist()And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.
Yes, in general, I much prefer the approach of |
I think that's outdated for Python -> Rust. I haven't tried but you should be able to pass a But I assume there's no way today to easily return a
I'm fine with that; and I think other maintainers would probably be fine with that too, since it's only a concept that exists in the Python integration. I'm not sure I totally get your example. Seems bad to be returning a union of multiple types to Python. But seems reasonable to return a
Well there's nothing stopping you from materializing the stream by passing it to
You can use |
Yes, exactly, that's what I even mentioned here in this PR (https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR491-R492 + https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR545-R549): This is even used to convert As you said, the opposite, namely easily returning a
My example wasn't entirely complete for simplicitly (and still isn't), it would be more something like this: class ParquetFile:
@overload
def read_row_group(self, index: int, as_table: Literal[True]) -> pyarrow.Table: ...
@overload
def read_row_group(self, index: int, as_table: Literal[False]) -> pyarrow.RecordBatch: ...
def read_row_group(self, index: int, as_table: bool = False) -> pyarrow.RecordBatch | pyarrow.Table: ...The advantage of that would be that both class ToListCapable(Protocol):
def to_pylist(self) -> list[dict[str, Any]]: ...
class ParquetFile:
def read_row_group(self, index: int, as_table: bool = False) -> ToListCapable: ...
&
Yes, sure!. We also do that in other places, or have entirely streamable pipelines elsewhere that use the PyCapsule ArrowStream interface. It's just that for this very specific use case, a |
|
I am not a python expert nor have I fully understood all the discussion on this ticket,
This would be my preferred approach -- make it easy to go from Rust <> Python, while trying to encourage good practices (e.g. streaming). There is no reason to be pedantic and force someone through hoops to make PyTable if that is what they want |
|
I now added a bunch of tests in |
61464b5 to
37b46be
Compare
arrow-pyarrow/src/lib.rs
Outdated
| //! For example, a `pyarrow.Table` can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>` | ||
| //! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule interface). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to note here that another advantage of using ArrowArrayStreamReader is that it works with tables and stream input out of the box. It doesn't matter which type the user passes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reading through the docs again, I'd suggest making a reference to Box<dyn RecordBatchReader> rather than ArrowArrayStreamReader. The former is a higher level API and much easier to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to note here that another advantage of using ArrowArrayStreamReader is that it works with tables and stream input out of the box.
I added that in the docs.
Also reading through the docs again, I'd suggest making a reference to Box rather than ArrowArrayStreamReader. The former is a higher level API and much easier to use.
I'm not exactly sure what you mean here. Box<dyn RecordBatchReader> only implements IntoPyArrow, but not FromPyArrow. So in the example I state in the new documentation, that for consuming a pyarrow.Table in Rust, also a streaming approach could be used, the Box<dyn RecordBatchReader> isn't helping sadly. One has to use ArrowArrayStreamReader, since that properly implements FromPyArrow.
|
Hey @kylebarron, thanks for the review! I implemented everything as you suggested. As you can see, now the CI is broken, because of a suttle problem that was uncovered. Maybe you can help me, as I'm not too familiar with the FFI logic and all my sanity checks did not help me: In the The The schema itself from the This previously worked because I used an Sanity checks:
EDIT: More importantly, omitting the schema check in |
|
@kylebarron for now I stole the |
arrow-pyarrow/src/lib.rs
Outdated
| for record_batch in &record_batches { | ||
| if !schema_equals(&schema, &record_batch.schema()) { | ||
| return Err(ArrowError::SchemaError( | ||
| //"All record batches must have the same schema.".to_owned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //"All record batches must have the same schema.".to_owned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have the more verbose error message here right now to understand what's going on in the schema mismatch. This is currently commented out to signal that this is not intended to be merged as-is, but the schema mismatch issue shall be understood first.
In general I'm opinionless about how verbose the error message shall be, I'd happily to eventually remove whatever variant you dislike.
94b3cf3 to
52047b8
Compare
|
Hey, I pushed a version that actually does not use the PyCapsule ArrayStream interface for converting a
With that, I got RecordBatches with preserved metadata from Since I also checked on the Python side with a Potentially there is a slight misuse of the PyCapsule interface somewhere, as this definitely seems to return RecordBatches without metadata. I'm not too familiar with the low-level stuff there, but I'll try to investigate; help is appreciated! |
|
Okay, I think the error was in With that, consuming @alamb @kylebarron let me know whether the change of |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonded94
arrow-array/src/ffi_stream.rs
Outdated
| }; | ||
| Some(result.map(|data| RecordBatch::from(StructArray::from(data)))) | ||
| Some(result.map(|data| { | ||
| let struct_array = StructArray::from(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please leave a comment here explaining:
- The rationale for this form rather than just converting StructArray to RecordBatch
- A
SAFETYcomment explaining why theunsafeis ok (aka how the invariants required for RecordBatch::new_unchecked are satisfied)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for this form rather than just converting StructArray to RecordBatch
Basically what I explained here (#8790 (comment)) => StructArray alone by definition is metadata-less, in turn leading to the problem that the resulting RecordBatch won't have any metadata attached if you just return it as-is.
I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.
Other ways to attach metadata to an existing RecordBatch would be, as far as I can see, to call with_schema() (which will incure some "is subschema test" costs) or somehow through schema_metadata_mut(), but the interface feels a bit clunky for this specific task IMHO.
A SAFETY comment explaining why the unsafe is ok (aka how the invariants required for RecordBatch::new_unchecked are satisfied)
One reason for the unsafe here is that I did not want to introduce performance penalties in comparison to what the interface did before (it just returned RecordBatch without checking whether it's actually corresponding to the schema of ArrowArrayStreamReader; and the schemas actually mismatched before my change, at least metadata-wise).
In principle Iterator of ArrowArrayStreamReader returns Result, so we can make this fallible through RecordBatch::try_new(...). This would incur some costs though, such as checking each column for correct nullability, equal and correct row count, type checks, etc..
I would have guessed that at least data-wise the interface can be trusted and therefore the checks can be omitted? 😅 I'm really not the expert here, I would have assumed that someone from the arrow-rs team could have some opinion here 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.
What is current on main is pretty elegant
RecordBatch::from(StructArray::from(data)))It is very important that this crate doesn't introduce unsoundness bugs, so in general we try and avoid unsafe unless there is a compelling justification, as described here
https://github.com/apache/arrow-rs/blob/main/arrow/README.md#safety
So at the least this code should have a justification (as a comment) about why unsafe is ok (aka why the invariants are known to be true) rather than using the safe alternate
I suggest:
- Remove the use of unsafe in this PR
- Make a follow on PR that proposes converting this to
unsafewhere we can discuss the merits in a focused discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is current on main is pretty elegant
RecordBatch::from(StructArray::from(data)))
I personally wouldn't call this elegant, as it's actually currently leading to unsoundness problems which I already stated a few times in this discussion 😬
- Since
StructArrayis metadata-less, theRecordBatchconstructed from it is metadata-less by definition too. This means, the returnedRecordBatchcan't ever have the schema advertised byArrowArrayStreamReader, at least if it's supposed to have metadata. - Much more importantly: There currently are no checks whether the
RecordBatchconstructed fromStructArrayand returned by the stream reader even corresponds to the schema whichArrowArrayStreamReaderadvertises. As far as I can see they are returned as-is, and in prinicple it's possible to make this interface return an differentRecordBatch. This was the cause of this PR, namely that it's currently returning unsoundRecordBatchwithout advertising this as an error somewhere.
The current state with unsafe in this PR more or less leaves problem 2 intact (with arguably introducing a chance of an additional invariant incorrectness in the RecordBatch itself), but at least fixes problem 1.
Nevertheless, I will proceed with introducing schema checking now and remove unsafe, which would also fix problem 2 in general, with a tiny additional compute cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically what I explained here (#8790 (comment)) =>
StructArrayalone by definition is metadata-less, in turn leading to the problem that the resultingRecordBatchwon't have any metadata attached if you just return it as-is.I'm not sure whether there is another more elegant way to construct a
RecordBatchwith corresponding metadata fromArrayData. Right now I'm going throughStructArraybecause the previous interface did that too. If there is another more elegant way, please let me know.
I've complained about this before (though I can't find in what issue), and is one of the reasons I document for why I created pyo3-arrow. It's currently impossible (I believe) in arrow-rs to persist extension metadata through the FFI interface.
I think we need a broader PR to handle this though; it shouldn't be shoehorned into this PR that is focused on the PyArrow Table handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it just returned
RecordBatchwithout checking whether it's actually corresponding to the schema ofArrowArrayStreamReader
And I think it would be better to actually verify that the schema is correct and matches the declared schema of the stream. I'm pretty sure pyarrow checks the schema is correct.
| } | ||
| } | ||
|
|
||
| /// This is a convenience wrapper around `Vec<RecordBatch>` that tries to simplify conversion from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar enough with how the python interface works to know if this is reasonable or not. Perhaps @kylebarron can help review this part
|
In this commit I also augmented the tests in Please let me know whether I did this right. |
|
@alamb @kylebarron could I get a review on this PR again? 🥺 Functionally I would consider it complete. I'm just unsure whether the ArrowArrayStreamReader can always be trusted to produce RecordBatches with the correct schema, at least data/column-wise, already mentioned in my comment here. This informs whether the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonded94 -- other than the use of unsafe this PR looks good to me. Thank you for your patience
arrow-array/src/ffi_stream.rs
Outdated
| }; | ||
| Some(result.map(|data| RecordBatch::from(StructArray::from(data)))) | ||
| Some(result.map(|data| { | ||
| let struct_array = StructArray::from(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.
What is current on main is pretty elegant
RecordBatch::from(StructArray::from(data)))It is very important that this crate doesn't introduce unsoundness bugs, so in general we try and avoid unsafe unless there is a compelling justification, as described here
https://github.com/apache/arrow-rs/blob/main/arrow/README.md#safety
So at the least this code should have a justification (as a comment) about why unsafe is ok (aka why the invariants are known to be true) rather than using the safe alternate
I suggest:
- Remove the use of unsafe in this PR
- Make a follow on PR that proposes converting this to
unsafewhere we can discuss the merits in a focused discussion
arrow-array/src/ffi_stream.rs
Outdated
| let array = unsafe { from_ffi(ffi_array, &ffi_schema) }.unwrap(); | ||
|
|
||
| let record_batch = RecordBatch::from(StructArray::from(array)); | ||
| let record_batch = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here -- unless there is some compelling reason we shouldn't be using unsafe, even thought this is test code
|
I've been at a conference or on vacation the last three weeks and am working through my backlog now. I'll try to review this today or tomorrow. |
|
I think for now it would be better to remove usage of |
Thanks for your review! Okay, then I'd propose to:
|
Remove metadata tests
7cf6f70 to
f04a0d6
Compare
|
Done, this PR is The fix for the As soon as both PRs are merged, I'll submit another PR that augments the tests in |
|
|
||
| class ArrayWrapper: | ||
| def __init__(self, array): | ||
| class ArrayWrapper(ArrowArrayExportable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually have to subclass from the prototype; the type checker will automatically check for structural type equality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One doesn't have to subtype a typing.Protocol, yes (this is the whole idea behind it, i.e. not doing static typing but structural typing, which can allow consuming external objects that don't have to inherit directly from your classes as long as they conform to a certain pattern).
But in cases where you anyways have strong control over your own classes, I find it highly beneficial to always inherit directly from the Protocol if possible. This has the advantage that it will move the detection of type mismatches to the place where you defined your class, instead of requiring you to make sure that you used all classes in business logic where objects conforming to a certain protocol are expected. Also, I saw a bit over the years that with very intricate Protocols, subtle type errors sometimes can be caught a bit more reliably with existing Python type checkers when directly inheriting from a Protocol, but this shouldn't be really relevant here I think.
Besides that, I sadly don't think that there anyways are type checks actually running in the CI or so 😅 I think the only thing done here is to compile the Python package and run the tests. There probably should be another PR introducing some type checking with mypy --strict or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look through this and it looks good to me. Thank you @jonded94 and @kylebarron

Rationale for this change
When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (#7973).
In
pyarrowthis is usually solved by representing data as apyarrow.Tableobject whose columns areChunkedArrays, which basically are just lists of Arrow Arrays, or alternatively, thepyarrow.Tableis just a representation of a list ofRecordBatches.I'd like to build a function in PyO3 that returns a
pyarrow.Table, very similar to pyarrow's read_row_group method. With that, we could have feature parity withpyarrowin circumstances of potential index overflows without resorting to type changes (such as reading the data asLargeStringorStringViewcolumns).Currently, AFAIS, there is no way in
arrow-pyarrowto export apyarrow.Tabledirectly. Especially convenience methods fromVec<RecordBatch>seem to be missing. This PR tries to implement a convenience wrapper that allows directly exportingpyarrow.Table.What changes are included in this PR?
A new struct
Tablein the cratearrow-pyarrowis added which can be constructed fromVec<RecordBatch>or fromArrowArrayStreamReader.It implements
FromPyArrowandIntoPyArrow.FromPyArrowwill support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has ato_reader()method which does that.pyarrow.Tabledoes both of these things.IntoPyArrowwill result int apyarrow.Tableon the Python side, constructed throughpyarrow.Table.from_batches(...).Are these changes tested?
Yes, in
arrow-pyarrow-integration-tests.Are there any user-facing changes?
A new
Tableconvience wrapper is added!