Skip to content

Conversation

@jonded94
Copy link
Contributor

@jonded94 jonded94 commented Nov 5, 2025

Rationale for this change

When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (#7973).

In pyarrow this is usually solved by representing data as a pyarrow.Table object whose columns are ChunkedArrays, which basically are just lists of Arrow Arrays, or alternatively, the pyarrow.Table is just a representation of a list of RecordBatches.

I'd like to build a function in PyO3 that returns a pyarrow.Table, very similar to pyarrow's read_row_group method. With that, we could have feature parity with pyarrow in circumstances of potential index overflows without resorting to type changes (such as reading the data as LargeString or StringView columns).
Currently, AFAIS, there is no way in arrow-pyarrow to export a pyarrow.Table directly. Especially convenience methods from Vec<RecordBatch> seem to be missing. This PR tries to implement a convenience wrapper that allows directly exporting pyarrow.Table.

What changes are included in this PR?

A new struct Table in the crate arrow-pyarrow is added which can be constructed from Vec<RecordBatch> or from ArrowArrayStreamReader.
It implements FromPyArrow and IntoPyArrow.

FromPyArrow will support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has a to_reader() method which does that. pyarrow.Table does both of these things.
IntoPyArrow will result int a pyarrow.Table on the Python side, constructed through pyarrow.Table.from_batches(...).

Are these changes tested?

Yes, in arrow-pyarrow-integration-tests.

Are there any user-facing changes?

A new Table convience wrapper is added!

Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 6, 2025

Thanks @kylebarron for your very quick review! ❤️

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in arrow-pyarrow is that the documentation even mentions that there is no equivalent concept to pyarrow.Table in arrow-pyarrow and that one has to do slight workarounds to use them:

PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn’t have these same concepts. A chunked table is instead represented with Vec. A pyarrow.Table can be imported to Rust by calling pyarrow.Table.to_reader() and then importing the reader as a [ArrowArrayStreamReader].

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec<RecordBatch> on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the pyarrow.Table convenience wrapper shall only be used in cases where users know what they're doing.

Slightly nicer Python workflow

In our very specific example, we have a Python class with a function such as this one:

class ParquetFile:
  def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...

In the issue I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single RecordBatch without changing types somewhere. Either you'd have to change the underlying Arrow types from String to LargeString or StringView, or you change the returned type from pyarrow.RecordBatch to Iterator[pyarrow.RecordBatch] for example (or RecordBatchReader or any other streaming-capable object).

The latter comes with a bit of syntactic shortcomings in contexts where you want to apply .to_pylist() on whatever read_row_group(...) returns:

rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]]
if isinstance(rg, pyarrow.RecordBatch):
  python_objs = rg.to_pylist()
else:
  python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg))

With pyarrow.Table, there already exists a thing which simplifies this a lot on the Python side:

rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]] = rg.to_pylist()

And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

Yes, in general, I much prefer the approach of arro3 to be totally pyarrow agnostic. In our case unfortunately, we're right now still pretty hardcoded against pyarrow specifics and just use arrow-rs as a means to reduce memory load compared to reading & writing parquet datasets with pyarrow directly.

@kylebarron
Copy link
Member

one has to do slight workarounds to use them:

I think that's outdated for Python -> Rust. I haven't tried but you should be able to pass a pyarrow.Table directly into an ArrowArrayStreamReader on the Rust side, because it just looks for the __arrow_c_stream__ method that exists either on the Table or the pyarrow.RecordBatchReader.

But I assume there's no way today to easily return a Table from Rust to Python.

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec<RecordBatch> on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust.

I'm fine with that; and I think other maintainers would probably be fine with that too, since it's only a concept that exists in the Python integration.

I'm not sure I totally get your example. Seems bad to be returning a union of multiple types to Python. But seems reasonable to return a Table there. The alternative is to return a stream and have the user either iterate over it lazily or choose to materialize it with pa.table(ParquetFile.read_row_group(...)).

And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.

Well there's nothing stopping you from materializing the stream by passing it to pa.table(). You don't have to use the stream as a stream.

Yes, in general, I much prefer the approach of arro3 to be totally pyarrow agnostic. In our case unfortunately, we're right now still pretty hardcoded against pyarrow specifics and just use arrow-rs as a means to reduce memory load compared to reading & writing parquet datasets with pyarrow directly.

You can use pyo3-arrow with pyarrow as well, but I'm not opposed to adding this functionality to arrow-rs as well.

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 6, 2025

I think that's outdated for Python -> Rust. I haven't tried but you should be able to pass a pyarrow.Table directly into an ArrowArrayStreamReader on the Rust side

Yes, exactly, that's what I even mentioned here in this PR (https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR491-R492 + https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR545-R549):

/// (although technically, since `pyarrow.Table` implements the ArrayStreamReader PyCapsule
/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` instead)

This is even used to convert pyarrow.Table to ArrowArrayStreamReader and eventually to Table: https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR544 + https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR567

As you said, the opposite, namely easily returning a Vec<RecordBatch> as a pyarrow.Table to Python is what's really missing here and what this PR mainly is about.

I'm not sure I totally get your example. Seems bad to be returning a union of multiple types to Python.

My example wasn't entirely complete for simplicitly (and still isn't), it would be more something like this:

class ParquetFile:
  @overload
  def read_row_group(self, index: int, as_table: Literal[True]) -> pyarrow.Table: ...
  @overload
  def read_row_group(self, index: int, as_table: Literal[False]) -> pyarrow.RecordBatch: ...
  def read_row_group(self, index: int, as_table: bool = False) -> pyarrow.RecordBatch | pyarrow.Table: ...

The advantage of that would be that both pyrrow.RecordBatch and pyarrow.Table implement .to_list() -> list[dict[str, Any]]. This is the important bit here, as we later just want to be able to call to_pylist() on whatever singular object read_row_group(...) returns and be guaranteed that the entire row group is deserialized as Python objects in this list. So it also could be expressed in our very specific example as:

class ToListCapable(Protocol):
  def to_pylist(self) -> list[dict[str, Any]]: ...

class ParquetFile:
  def read_row_group(self, index: int, as_table: bool = False) -> ToListCapable: ...

The alternative is to return a stream and have the user either iterate over it lazily or choose to materialize it with pa.table(ParquetFile.read_row_group(...)).

&

Well there's nothing stopping you from materializing the stream by passing it to pa.table(). You don't have to use the stream as a stream.

Yes, sure!. We also do that in other places, or have entirely streamable pipelines elsewhere that use the PyCapsule ArrowStream interface. It's just that for this very specific use case, a Vec<RecordBatch> -> pyarrow.Table convenience wrapper perfectly maps to what we need with no required changes in any consuming code, and I would be interested in whether maintainers of arrow-pyarrow find that useful for similar very specific niche use cases, as I said.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2025

I am not a python expert nor have I fully understood all the discussion on this ticket,

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the pyarrow.Table convenience wrapper shall only be used in cases where users know what they're doing.

This would be my preferred approach -- make it easy to go from Rust <> Python, while trying to encourage good practices (e.g. streaming). There is no reason to be pedantic and force someone through hoops to make PyTable if that is what they want

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 7, 2025

I now added a bunch of tests in arrow-pyarrow-integration-testing and overhauled Table to allow exporting empty pyarrow.Tables. I also edited the arrow-pyarrow create docstring to signal that there now exists a pyarrow.Table equivalent, but streaming approaches in general are preferred.

@jonded94 jonded94 force-pushed the implement-pyarrow-table-convenience-class branch 2 times, most recently from 61464b5 to 37b46be Compare November 7, 2025 14:52
Comment on lines 59 to 60
//! For example, a `pyarrow.Table` can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>`
//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule interface).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to note here that another advantage of using ArrowArrayStreamReader is that it works with tables and stream input out of the box. It doesn't matter which type the user passes in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well actually slight correction, assuming PyCapsule Interface input, both Table and ArrowArrayStreamReader will work with both table and stream input out of the box, the difference is just whether the Rust code materializes the data.

This is why I have this table in the pyo3-arrow docs:

Image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reading through the docs again, I'd suggest making a reference to Box<dyn RecordBatchReader> rather than ArrowArrayStreamReader. The former is a higher level API and much easier to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to note here that another advantage of using ArrowArrayStreamReader is that it works with tables and stream input out of the box.

I added that in the docs.

Also reading through the docs again, I'd suggest making a reference to Box rather than ArrowArrayStreamReader. The former is a higher level API and much easier to use.

I'm not exactly sure what you mean here. Box<dyn RecordBatchReader> only implements IntoPyArrow, but not FromPyArrow. So in the example I state in the new documentation, that for consuming a pyarrow.Table in Rust, also a streaming approach could be used, the Box<dyn RecordBatchReader> isn't helping sadly. One has to use ArrowArrayStreamReader, since that properly implements FromPyArrow.

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 7, 2025

Hey @kylebarron, thanks for the review! I implemented everything as you suggested.

As you can see, now the CI is broken, because of a suttle problem that was uncovered. Maybe you can help me, as I'm not too familiar with the FFI logic and all my sanity checks did not help me:

In the test_table_roundtrip Python test in arrow-pyarrow-integration-test, we're simply handing a pyarrow.Table to Rust and letting it roundtrip through the conversion layers back to a pyarrow.Table. Unfortunately, the conversion from PyArrowType<Table> -> ArrowArrayStreamReader -> Box<dyn RecordBatchReader> -> Table is now failing, specifically the last part (Box<dyn RecordBatchReader> -> Table).

The RecordBatches that read from Box<dyn RecordBatchReader> in the try_new function of Table seem to be metadata-less. This leads to an error, because the try_new function validates that the schema of all record batches corresponds to the explicitly given schema.

The schema itself from the Box<dyn RecordBatchReader> still has the metadata {"key1": "value1"} attached, but not the individual RecordBatches. I left a somewhat verbose error message in the Rust error:

ValueError: Schema error: All record batches must have the same schema. Expected schema: Schema { fields: [Field { name: "ints", data_type: List(Field { data_type: Int32, nullable: true }), nullable: true }], metadata: {"key1": "value1"} }, got schema: Schema { fields: [Field { name: "ints", data_type: List(Field { data_type: Int32, nullable: true }), nullable: true }], metadata: {} 

This previously worked because I used an unsafe interface for building a Table before which didn't check for schema validity.

Sanity checks:

  • All other roundtrips work without problem, the metadata seems to be handed through all layers otherwise
  • In the failing Python test test_table_roundtrip, I asserted that the pyarrow.Table definitly has the metadata still attached, and especially all RecordBatches from it
    • Just in the conversion to Rust RecordBatch through the Box<dyn ArrowArrayStreamReader> they somehow seem to lose their metadata. Is this something which the FFI interface doesn't guarantee?

EDIT: More importantly, omitting the schema check in Table::try_new actually let's the test test_table_roundtrip succeed. I saw that in pyo3-arrow, you don't check schema equality with schema == record_batch.schema(), but have a custom function schema_equals which seem to be a little bit more forgiving? Shall we use something similar in Table::try_new?

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 7, 2025

@kylebarron for now I stole the schema_equals function from pyo3_arrow, and everything seems to work again. I don't quite understand why the Record Batches are metadata less when read from Box<dyn RecordBatchReader> though.

for record_batch in &record_batches {
if !schema_equals(&schema, &record_batch.schema()) {
return Err(ArrowError::SchemaError(
//"All record batches must have the same schema.".to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//"All record batches must have the same schema.".to_owned(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have the more verbose error message here right now to understand what's going on in the schema mismatch. This is currently commented out to signal that this is not intended to be merged as-is, but the schema mismatch issue shall be understood first.

In general I'm opinionless about how verbose the error message shall be, I'd happily to eventually remove whatever variant you dislike.

@jonded94 jonded94 force-pushed the implement-pyarrow-table-convenience-class branch from 94b3cf3 to 52047b8 Compare November 15, 2025 13:43
@jonded94
Copy link
Contributor Author

Hey, I pushed a version that actually does not use the PyCapsule ArrayStream interface for converting a pyarrow.Table to Table, if the given python object has a to_batches() method (which the pyarrow.Table does). This is not necessarily intended to stay that way, but this is helpful for diagnosing where RecordBatch metadata is dropped.

pyarrow.Table.to_batches() returns a list[pyarrow.RecordBatch] which I explicitly convert to Vec<RecordBatch> in the from_pyarrow_bound function of impl FromPyArrow for Table. This basically is the equivalent of what I'm doing in the corresponding impl IntoPyArrow for Table, as I'm not using the PyCapsule interface there, but just immediately construct a pyarrow.Table out of Vec<RecordBatch> through pyarrow.Table.from_batches(...).

With that, I got RecordBatches with preserved metadata from pyarrow.Table, in turn allowing me to drop the schema_equals function but instead do a full schema == record_batch.schema() check.

Since I also checked on the Python side with a pyarrow.RecordBatchReader.from_stream of a StreamWrapper around a pyarrow.Table that RecordBatches from a ArrayStream PyCapsule interface of a pyarrow.Table definitely still have their metadata, the error has to be on the Rust side somewhere in the Box<dyn RecordBatchReader> / impl FromPyArrow for ArrowArrayStreamReader method.

Potentially there is a slight misuse of the PyCapsule interface somewhere, as this definitely seems to return RecordBatches without metadata. I'm not too familiar with the low-level stuff there, but I'll try to investigate; help is appreciated!

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 15, 2025
@jonded94
Copy link
Contributor Author

jonded94 commented Nov 15, 2025

Okay, I think the error was in impl Iterator for ArrowArrayStreamReader. That actually did not correctly expose RecordBatches with metadata attached, but was just directly constructing RecordBatch from StructArray (which is metadata-less by definition).

With that, consuming pyarrow.Table through the PyCapsule interface (and not with this custom Table.to_batches() method) seems to work now, which means I could remove that again.

@alamb @kylebarron let me know whether the change of impl Iterator for ArrowArrayStreamReader in arrow-array/src/ffi_stream.rs is okay.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonded94

};
Some(result.map(|data| RecordBatch::from(StructArray::from(data))))
Some(result.map(|data| {
let struct_array = StructArray::from(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please leave a comment here explaining:

  1. The rationale for this form rather than just converting StructArray to RecordBatch
  2. A SAFETY comment explaining why the unsafe is ok (aka how the invariants required for RecordBatch::new_unchecked are satisfied)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for this form rather than just converting StructArray to RecordBatch

Basically what I explained here (#8790 (comment)) => StructArray alone by definition is metadata-less, in turn leading to the problem that the resulting RecordBatch won't have any metadata attached if you just return it as-is.

I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.

Other ways to attach metadata to an existing RecordBatch would be, as far as I can see, to call with_schema() (which will incure some "is subschema test" costs) or somehow through schema_metadata_mut(), but the interface feels a bit clunky for this specific task IMHO.

A SAFETY comment explaining why the unsafe is ok (aka how the invariants required for RecordBatch::new_unchecked are satisfied)

One reason for the unsafe here is that I did not want to introduce performance penalties in comparison to what the interface did before (it just returned RecordBatch without checking whether it's actually corresponding to the schema of ArrowArrayStreamReader; and the schemas actually mismatched before my change, at least metadata-wise).

In principle Iterator of ArrowArrayStreamReader returns Result, so we can make this fallible through RecordBatch::try_new(...). This would incur some costs though, such as checking each column for correct nullability, equal and correct row count, type checks, etc..

I would have guessed that at least data-wise the interface can be trusted and therefore the checks can be omitted? 😅 I'm really not the expert here, I would have assumed that someone from the arrow-rs team could have some opinion here 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.

What is current on main is pretty elegant

RecordBatch::from(StructArray::from(data)))

It is very important that this crate doesn't introduce unsoundness bugs, so in general we try and avoid unsafe unless there is a compelling justification, as described here
https://github.com/apache/arrow-rs/blob/main/arrow/README.md#safety

So at the least this code should have a justification (as a comment) about why unsafe is ok (aka why the invariants are known to be true) rather than using the safe alternate

I suggest:

  1. Remove the use of unsafe in this PR
  2. Make a follow on PR that proposes converting this to unsafe where we can discuss the merits in a focused discussion

Copy link
Contributor Author

@jonded94 jonded94 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is current on main is pretty elegant

RecordBatch::from(StructArray::from(data)))

I personally wouldn't call this elegant, as it's actually currently leading to unsoundness problems which I already stated a few times in this discussion 😬

  1. Since StructArray is metadata-less, the RecordBatch constructed from it is metadata-less by definition too. This means, the returned RecordBatch can't ever have the schema advertised by ArrowArrayStreamReader, at least if it's supposed to have metadata.
  2. Much more importantly: There currently are no checks whether the RecordBatch constructed from StructArray and returned by the stream reader even corresponds to the schema which ArrowArrayStreamReader advertises. As far as I can see they are returned as-is, and in prinicple it's possible to make this interface return an different RecordBatch. This was the cause of this PR, namely that it's currently returning unsound RecordBatch without advertising this as an error somewhere.

The current state with unsafe in this PR more or less leaves problem 2 intact (with arguably introducing a chance of an additional invariant incorrectness in the RecordBatch itself), but at least fixes problem 1.

Nevertheless, I will proceed with introducing schema checking now and remove unsafe, which would also fix problem 2 in general, with a tiny additional compute cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically what I explained here (#8790 (comment)) => StructArray alone by definition is metadata-less, in turn leading to the problem that the resulting RecordBatch won't have any metadata attached if you just return it as-is.

I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.

I've complained about this before (though I can't find in what issue), and is one of the reasons I document for why I created pyo3-arrow. It's currently impossible (I believe) in arrow-rs to persist extension metadata through the FFI interface.

I think we need a broader PR to handle this though; it shouldn't be shoehorned into this PR that is focused on the PyArrow Table handling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just returned RecordBatch without checking whether it's actually corresponding to the schema of ArrowArrayStreamReader

And I think it would be better to actually verify that the schema is correct and matches the declared schema of the stream. I'm pretty sure pyarrow checks the schema is correct.

}
}

/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to simplify conversion from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with how the python interface works to know if this is reasonable or not. Perhaps @kylebarron can help review this part

@jonded94
Copy link
Contributor Author

In this commit I also augmented the tests in ffi_stream.rs to actually check for exported/imported metadata, instead of dealing with a metadata-less schema.

Please let me know whether I did this right.

@jonded94
Copy link
Contributor Author

@alamb @kylebarron could I get a review on this PR again? 🥺

Functionally I would consider it complete. I'm just unsure whether the ArrowArrayStreamReader can always be trusted to produce RecordBatches with the correct schema, at least data/column-wise, already mentioned in my comment here.

This informs whether the unsafe logic I do there is justified or whether a potentially a bit costly check on schema validity on every RecordBatch would have to be introduced (which we didn't do before).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonded94 -- other than the use of unsafe this PR looks good to me. Thank you for your patience

};
Some(result.map(|data| RecordBatch::from(StructArray::from(data))))
Some(result.map(|data| {
let struct_array = StructArray::from(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether there is another more elegant way to construct a RecordBatch with corresponding metadata from ArrayData. Right now I'm going through StructArray because the previous interface did that too. If there is another more elegant way, please let me know.

What is current on main is pretty elegant

RecordBatch::from(StructArray::from(data)))

It is very important that this crate doesn't introduce unsoundness bugs, so in general we try and avoid unsafe unless there is a compelling justification, as described here
https://github.com/apache/arrow-rs/blob/main/arrow/README.md#safety

So at the least this code should have a justification (as a comment) about why unsafe is ok (aka why the invariants are known to be true) rather than using the safe alternate

I suggest:

  1. Remove the use of unsafe in this PR
  2. Make a follow on PR that proposes converting this to unsafe where we can discuss the merits in a focused discussion

let array = unsafe { from_ffi(ffi_array, &ffi_schema) }.unwrap();

let record_batch = RecordBatch::from(StructArray::from(array));
let record_batch = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here -- unless there is some compelling reason we shouldn't be using unsafe, even thought this is test code

@kylebarron
Copy link
Member

I've been at a conference or on vacation the last three weeks and am working through my backlog now. I'll try to review this today or tomorrow.

@kylebarron
Copy link
Member

I think for now it would be better to remove usage of unsafe in this PR, even if that means it drops the metadata on each record batch, and then we can discuss that further in a new issue.

@jonded94
Copy link
Contributor Author

jonded94 commented Dec 3, 2025

I think for now it would be better to remove usage of unsafe in this PR, even if that means it drops the metadata on each record batch, and then we can discuss that further in a new issue.

Thanks for your review! Okay, then I'd propose to:

  • Reduce the scope of this PR to just implementing the pyarrow.Table wrapper, and let it be somewhat broken in the context of schemas with metadata for now
  • Open another PR just fixing metadata issues in ArrowArrayStreamReader, and expanding the tests for the pyarrow.Table wrapper with metadata checking
  • Potentially discuss in another issue whether schema checking inside ArrowArrayStreamReader is required or whether unsafe can be used there

@jonded94 jonded94 force-pushed the implement-pyarrow-table-convenience-class branch from 7cf6f70 to f04a0d6 Compare December 3, 2025 13:58
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Dec 3, 2025
@jonded94
Copy link
Contributor Author

jonded94 commented Dec 3, 2025

Done, this PR is unsafe less and only adds the pyarrow.Table convience wrapper.

The fix for the ArrowArrayStreamReader can be found in this PR: #8944

As soon as both PRs are merged, I'll submit another PR that augments the tests in arrow-pyarrow-integration-tests to include more in-depth metadata checks.


class ArrayWrapper:
def __init__(self, array):
class ArrayWrapper(ArrowArrayExportable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you actually have to subclass from the prototype; the type checker will automatically check for structural type equality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One doesn't have to subtype a typing.Protocol, yes (this is the whole idea behind it, i.e. not doing static typing but structural typing, which can allow consuming external objects that don't have to inherit directly from your classes as long as they conform to a certain pattern).

But in cases where you anyways have strong control over your own classes, I find it highly beneficial to always inherit directly from the Protocol if possible. This has the advantage that it will move the detection of type mismatches to the place where you defined your class, instead of requiring you to make sure that you used all classes in business logic where objects conforming to a certain protocol are expected. Also, I saw a bit over the years that with very intricate Protocols, subtle type errors sometimes can be caught a bit more reliably with existing Python type checkers when directly inheriting from a Protocol, but this shouldn't be really relevant here I think.

Besides that, I sadly don't think that there anyways are type checks actually running in the CI or so 😅 I think the only thing done here is to compile the Python package and run the tests. There probably should be another PR introducing some type checking with mypy --strict or so.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look through this and it looks good to me. Thank you @jonded94 and @kylebarron

@alamb alamb merged commit a67cd19 into apache:main Dec 5, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants