-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump to PyArrow 17.0.0 #929
Conversation
Vote has been passed: https://lists.apache.org/thread/mnzdpwzhctx6yrjl16zn8hl7pcxxt575 |
Amazing. Given that this issue (#936) also requires 17.0.0 for the fix, maybe it's right for us to move forward with onboarding |
@syun64 It has been released, and I've updated the lockfile |
Awesome :) we live in exciting times 🎉 |
It looks like the 3.9 artifacts are missing: |
The missing wheels and source distribution for pyarrow 17.0.0 have been uploaded to PyPI. Sorry for the inconvenience. |
@raulcd No problem, thanks for the heads up here 👍 |
e972de5
to
a396149
Compare
@syun64 @HonahX @kevinjqliu This provides a nice cleanup of the types (and probably also a speed-up), the downside is that we have to raise the lower bound to PyArrow 17. PTAL |
pa.field( | ||
"address", | ||
pa.struct([ | ||
pa.field("street", pa.large_string()), | ||
pa.field("city", pa.large_string()), | ||
pa.field("street", pa.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally outsider here but curious, was there a bug on pyarrow that made those large_string
instead of string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulcd - It wasn't a bug, but actually an intentional change for the time being. If we update to PyArrow 17.0.0 we will be able to revert that change, and let the encoding in the parquet file dictate whether the table should be read as a large or small type for the Table API.
btw, are dependabot PRs automatically merged? It seems it updated pyarrow (4282d2f) |
Great question @Fokko ... after thinking a lot about this the past week, here's my long answer organized by different topics of consideration Benefits of 17.0.0
User's ability to use PyIceberg in applications
** -> I'm of the impression that while this change seems to make sense from the perspective of preserving type or encoding correctness, it will actually result in a performance regression due to the fact that we will be reading most batches as small types, but having to cast them to large types (infrequently for pa.Table, but always for pa.RecordBatchReader). Another option is to always choose to cast to a small type instead in Based on these points, I'm leaning towards not aggressively increasing the lower bound to 17.0.0, at least for this minor release, but I'm very excited to hear what others think as well! |
@syun64 already pointed to the cost/benefits of upgrading. I lean more towards correctness than performance. What is the correctness issue if we do not upgrade? As I understand from the above, if the parquet file is of type As for updating the minimum dependency to pyarrow 17.0.0, I would prefer to wait for the new arrow version to be baked for a time before we require all new versions of Pyiceberg to use it. I also think the 0.7.0 release's feature set is getting massive. We can add this upgrade as a fast-follow release. |
9969926
to
921cd84
Compare
921cd84
to
73b8965
Compare
No description provided.