-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PDEP-15: Reject PDEP-10 #58623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PDEP-15: Reject PDEP-10 #58623
Changes from 1 commit
98eb85a
5e451db
2af5632
6e4efe5
45754bf
7833637
1ccca56
1b3bdee
e52e2e7
fef0c92
e5de753
c159851
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,48 @@ | ||
# PDEP-10: PyArrow as a required dependency for default string inference implementation | ||
|
||
- Created: 17 April 2023 | ||
- Status: Accepted | ||
- Created: 17 April 2023 (updated May 8, 2024) | ||
- Status: Rejected | ||
- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711) | ||
[#52509](https://github.com/pandas-dev/pandas/issues/52509) | ||
- Author: [Matthew Roeschke](https://github.com/mroeschke) | ||
[Patrick Hoefler](https://github.com/phofl) | ||
- Revision: 1 | ||
- Revision: 2 | ||
|
||
# Note | ||
|
||
This PDEP was originally accepted on May 8, 2023. However, after reviewing feedback posted | ||
on the feedback issue [#54466](https://github.com/pandas-dev/pandas/issues/54466), we, the members of | ||
the core team, have not decided with moving forward with this PDEP for pandas 3.0. | ||
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The primary reasons for rejecting this PDEP are twofold: | ||
|
||
1) Requiring pyarrow as a dependency causes installation problems. | ||
- Pyarrow does not fit or has a hard time fitting in space-constrained environments | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think what we could learn from this process is what caused this to change our minds? These issues were discussed leading up to the acceptance of PDEP-10. The way this is written I think reads more as "we discovered this after the fact" instead of "we decided that X amount of negative feedback on these points was enough to revert". I think there is some value to future PDEPs to set expectations around the latter There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Within the context of recent conversation I don't think this comment about AWS is true. AWS distributes an official pandas image for lambda which already includes pyarrow, pandas, and NumPy. This is all required by their own "AWS SDK on pandas" library. The issue more finely scoped I think is that the default wheel installation via pip into a lambda image exceeds the 256 MB limit. Either using the official AWS provided image or using miniconda should not exceed the space limits |
||
such as AWS Lambda and WASM, due to its large size of around ~40 MB for a compiled wheel | ||
(which is larger than pandas' own wheel sizes) | ||
- Installation of pyarrow is not possible on some platforms. We provide support for some | ||
less widely used platforms such as Alpine Linux (and there is third party support for pandas in | ||
pyodide, a WASM distribution of pandas), both of which pyarrow does not provide wheels for. | ||
|
||
While both of these reasons are mentioned in the drawbacks section of this PDEP, at the time of the writing | ||
of the PDEP, we underestimated the impact this would have on users, and also downstream developers. | ||
|
||
2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally don't find this point very convincing. Saying
On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc... I think this argument in its current form is saying "we don't need a car because we have a horse and buggy" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In PDEP-10, there are 3 benefits listed
So, IMO, this argument is accurate, in that most of the benefits in PDEP-10 can be made possible (for those user that have pyarrow installed) without making pyarrow required. The future benefits of Arrow are very compelling, but decisions on making a dependency required should be based on immediate and not future benefits. Like I said before, it is easy to reconsider this decision in a years time if those future benefits are materialize. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you think points 1 and 3 are possible without pyarrow then the alternatives for that should be laid out in this PDEP, at least at a super high level. I'm assuming point 1 refers to the nanoarrow POC I was sharing; point 3 requires reimplementing the conversions that pyarrow already has. (I personally don't think building either of those from scratch is a good long term solution but it can at least be discussed) For point 2 how do you know those are niche applications? Its easy to dismiss things that don't exist today as not worthwhile, but I get the feeling that there could be plenty of use cases for the aggregate types, since they have a natural fit with many of the Python containers. On interoperability the long term prospects for the dataframe interchange protocol seem dubious, and we have even discussed moving that out of pandas (see #56732).
The Arrow interchange protocol can be used by any library that needs to work with Arrow data - there is no limit to it being used by other dataframe libraries. It provides a standardized API so that third parties don't need to hack into our internals, which is a direct benefit for us. It also works in two directions - we can be a consumer just as much as a producer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also wanted to point out that arrow has a decimal128 and decimal256 type which is especially useful for financial calculations where floating point inaccuracies cannot be tolerated, and the arrow decimal types are an extremely significant improvement over using object. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, will update and add a note in the PDEP when I get time again. |
||
|
||
For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PDEP 14 does not change performance or memory savings if you do not have pyarrow installed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added a note in parentheses at the end of that sentence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you push this up? I don't see anything in parentheses. The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I updated the PDEP-15 text, and forgot to remove the PDEP-10 changes. I've removed the PDEP-10 changes now. |
||
as our current default object string data type, but that allows users to experience faster performance and memory savings | ||
compared to the object strings. | ||
|
||
While we've decided to not move forward with requiring pyarrow in pandas 3.0, the rejection of this PDEP | ||
does not mean that we are abandoning pyarrow support and integration in pandas. We, as the core team, still believe | ||
that adopting support for pyarrow arrays and data types in more of pandas will lead to greater interoperability with the | ||
ecosystem and better performance for users. Furthermore, a lot of the drawbacks, such as the large installation size of pyarrow | ||
and the lack of support for certain platforms, can be solved, and potential solutions have been proposed for them, allowing us | ||
to potentially revisit this decision in the future. | ||
|
||
However, at this point in time, it is clear that we are not ready to require pyarrow | ||
as a dependency in pandas. | ||
|
||
|
||
## Abstract | ||
|
||
|
@@ -210,6 +246,7 @@ before releasing a new pandas version. | |
|
||
- 17 April 2023: Initial version | ||
- 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1 | ||
- 8 May 2024: Changed status to rejected | ||
|
||
[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability> | ||
[^2] <https://arrow.apache.org/powered_by/> |
Uh oh!
There was an error while loading. Please reload this page.