Skip to content

DOC: add section about upcoming pandas 3.0 changes (string dtype, CoW) to 2.3 whatsnew notes #61795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

98 changes: 98 additions & 0 deletions doc/source/whatsnew/v2.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,104 @@ including other versions of pandas.

.. ---------------------------------------------------------------------------

.. _whatsnew_230.upcoming_changes:

Upcoming changes in pandas 3.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pandas 3.0 will bring two bigger changes to the default behavior of pandas.

Dedicated string data type by default
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Historically, pandas represented string columns with NumPy ``object`` data type.
This representation has numerous problems: it is not specific to strings (any
Python object can be stored in an ``object``-dtype array, not just strings) and
it is often not very efficient (both performance wise and for memory usage).

Starting with the upcoming pandas 3.0 release, a dedicated string data type will
be enabled by default (backed by PyArrow under the hood, if installed, otherwise
falling back to NumPy). This means that pandas will start inferring columns
containing string data as the new ``str`` data type when creating pandas
objects, such as in constructors or IO functions.

Old behavior:

.. code-block:: python

>>> ser = pd.Series(["a", "b"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need another line to actually output the repr or just change these to not include ser=

0 a
1 b
dtype: object

New behavior:

.. code-block:: python

>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: str

The string data type that is used in these scenarios will mostly behave as NumPy
object would, including missing value semantics and general operations on these
columns.

However, the introduction of a new default dtype will also have some breaking
consequences to your code (for example when checking for the ``.dtype`` being
object dtype). To allow testing it in advance of the pandas 3.0 release, this
future dtype inference logic can be enabled in pandas 2.3 with:

.. code-block:: python

pd.options.future.infer_string = True

See the :ref:`string_migration_guide` for more details on the behaviour changes
and how to adapt your code to the new default.

Copy-on-Write
^^^^^^^^^^^^^

The currently optional mode Copy-on-Write will be enabled by default in pandas 3.0. There
won't be an option to retain the legacy behavior.

In summary, the new "copy-on-write" behaviour will bring changes in behavior in
how pandas operates with respect to copies and views.

1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
i.e. including accessing a DataFrame column as a Series) or any method returning a
new DataFrame or Series, always *behaves as if* it were a copy in terms of user
API.
2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
to do this is to directly modify that object itself.

Because every single indexing step now behaves as a copy, this also means that
"chained assignment" (updating a DataFrame with multiple setitem steps) will
stop working. Because this now consistently never works, the
``SettingWithCopyWarning`` will be removed.

The new behavioral semantics are explained in more detail in the
:ref:`user guide about Copy-on-Write <copy_on_write>`.

The new behavior can be enabled since pandas 2.0 with the following option:

.. code-block:: python

pd.options.mode.copy_on_write = True

Some of the behaviour changes allow a clear deprecation, like the changes in
chained assignment. Other changes are more subtle and thus, the warnings are
hidden behind an option that can be enabled since pandas 2.2:

.. code-block:: python

pd.options.mode.copy_on_write = "warn"

This mode will warn in many different scenarios that aren't actually relevant to
most queries. We recommend exploring this mode, but it is not necessary to get rid
of all of these warnings. The :ref:`migration guide <copy_on_write.migration_guide>`
explains the upgrade process in more detail.

.. _whatsnew_230.enhancements:

Enhancements
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v2.3.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ correctly, rather than defaulting to ``object`` dtype. For example:

.. code-block:: python

>>> pd.options.mode.infer_string = True
>>> pd.options.future.infer_string = True
>>> df = pd.DataFrame()
>>> df.columns.dtype
dtype('int64') # default RangeIndex for empty columns
Expand Down