Support for sub-interpreters #5564

b-pass · 2025-03-16T01:25:08Z

Description

This PR add the ability for pybind11 modules to support subinterpreters. This support requires 2 things:

Multiphase init (so this PR depends on feat: change PYBIND11_MODULE to use multi-phase init (PEP 489) #5574)
internals and local_internals have to have an instance per-interpreter (can no longer be static singletons), which is (now) the primary subject of this PR

Multiphase init

The PR adds mod_per_interpreter_gil and mod_multi_interpreter_one_gil tags which can be passed as the 3rd argument to the macro (in addition to the existing mod_gil_not_used). If neither is specified, the module continues to do multiphase init but reports Py_MOD_MULTIPLE_INTERPRETERS_NOT_SUPPORTED.

When a module is imported a second time in a sub-interpreter, the module's exec slot is run again. For pybind11 this means the user's the module init function is re-run in the sub-interpreter. That's good, because the sub-interpreter needs it's own type_info for all of the bindings.

internals

This presents the problem that the place that pybind11 stores these is currently a singleton. But sub-interpreters need this state to be per-interpreter. That means the proper instance of these (for the current active interpreter) needs to be retrieved from the interpreter state dict. Fortunately, the internals pointer-to-pointer is already stored in the state dict.

In order to minimize performance costs, we can detect whether or not multi-subinterpreters are present by counting how many times the module has been imported. If it has only been imported once then it can only possibly have one internals (even if there are other subinterpreters where it was not imported). When it has been imported more times, then we need to do additional work to make sure the right internals object is used (the one associated with the current interpreter in the current thread). We can switch between these two cases with a single simple branch, thus causing minimal performance overhead for existing code.

In the multi-interpreter case we would still like to minimized the cost of accessing internals, we don't want to have to reach into the interpreter state dict every time. So we cache the value in a thread_local along with the pointer to the PyThreadState to which it belongs. This means that the slow path (acquiring the GIL, doing a dict lookup, etc) is only done when the active PyThreadState changes (or the first time get_internals is called in an OS thread). So the fast path merely checks that the PyThreadState hasn't changed, and then returns the previously looked up value.

local_internals

local_internals is a slightly larger change. It was also global singleton and was not stored in the interpreter state dict at all. It has changed to be much more like the internals code, and both have been refactored a little bit to share some of that code. local_internals needs to be per-module-per-interpreter (unlike internals), so we have to formulate a unique key for the state dict. Other than the key, the two now work almost identically.

Performance

On the current version of master, detail::get_internals() takes about 1.7ns per call on my machine.
On this PR without multiple subinterpreters present, detail::get_internals() takes about 1.95ns per call on my machine. (About 15% slower)
On this PR with multiple subinterpreters present, detail::get_internals() takes about 5.13ns per call on my machine. (About 300% slower).

So multiple subinterpreters does definitely introduce a cost when the feature is used, but merely supporting the use has only a small additional cost. The 15% additional cost of this very small function is unlikely to be noticeable in any meaningful program.

Memory management / Future work

This PR does not add support for creating / deleting / switching between sub-interpreters.

In embed, pybind11 only cleans up the internals and local_interals associated with the main interpreter (when it is finalized). SInce it doesn't currently manage any subinterpreters it can't clean up after them.

Suggested changelog entry:

* Support for sub-interpreters (both with and without separate GILs). Add ``py::mod_multi_interpreter_one_gil()`` or ``py::mod_per_interpreter_gil()`` tag to PYBIND11_MODULE (third parameter) to indicate if a module supports running with sub-interpreters.

The guide needs to add a short mention of py::mod_multi_interpreter_one_gil() and py::mod_per_interpreter_gil() tags.

henryiii · 2025-03-19T19:56:05Z

include/pybind11/detail/internals.h

@@ -260,7 +260,25 @@ struct type_info {
 /// Each module locally stores a pointer to the `internals` data. The data
 /// itself is shared among modules with the same `PYBIND11_INTERNALS_ID`.
 inline internals **&get_internals_pp() {
+#if defined(PYPY_VERSION) || defined(GRAALVM_PYTHON) || PY_VERSION_HEX < 0x030C0000               \


What about Emscripten/WASI? I'm assuming iOS and Android would behave like normal CPython, but does Wasm support this?

I'm not very familiar with non-CPython implementations, but it looks like Pyodide has some support for subinterpreters at least.

henryiii · 2025-03-19T20:16:04Z

Very excited to see this! I have a couple of comments/questions:

What happens if someone doesn't define PYBIND11_SUBINTERPRETER_SUPPORT but tries to use the new tags? I would think you'd want a compile-time error?
Do you know how much of an impact PYBIND11_SUBINTERPRETER_SUPPORT has? Just curious what constitutes "small". Having a compile time opt-in is a little unfortunate, but having it as a default-off option sounds safe for now, though.
I believe communication between extensions even with different compile definitions is still supported due to @rwgk's interface work.

We'll need some docs, too. Maybe we should do a full test run with the define on, too.

rwgk

This is a complex PR. I need to find more time later for a full review.

High-level questions:

Could it be useful to split this PR: 1. multi-phase init only. 2. multi-interpreter support? — That would make it easier to do the reviews now, and understand the development steps in the future. It might also help us dealing with bugs after this change is released.
Are you still working on a new Python tests, to exercise the new multi-interpreter support?
Is there a potential for bug or feature interference between free-threading and multi-interpreter functionality? — I think we'll need tests for all combinations of (free-threading on/off) x (multi-interpreter support on/off); not for all platforms, but maybe one each: Linux, macOS, Windows.

include/pybind11/detail/common.h

include/pybind11/detail/internals.h

b-pass · 2025-03-20T01:45:29Z

* Could it be useful to split this PR: 1. multi-phase init only. 2. multi-interpreter support? — That would make it easier to do the reviews now, and understand the development steps in the future. It might also help us dealing with bugs after this change is released.

Sure. I have created #5574 for multi-phase init only. I will keep updating this PR for sub-interpreter support, and will remove the multiphase init from this branch shortly.

* Are you still working on a new Python tests, to exercise the new multi-interpreter support?

I'm not currently, but I can add a few more after/along with some additional changes from comments.

* Is there a potential for bug or feature interference between free-threading and multi-interpreter functionality? — I think we'll need tests for all combinations of (free-threading on/off) x (multi-interpreter support on/off); not for all platforms, but maybe one each: Linux, macOS, Windows.

No one would ever say less testing :) I think the potential problems are small. While they have similar goals, free threading and own-gil-sub-interpreters are fairly different and could be used together. Sub-interpreters were originally created (I think) to offer sandboxing features. So even with free threading, the idea that a module is used in two different sandboxes is still valid, and it still needs separation for each instance.

Also I considered whether free-threading is a superset of sub-interpreter support. I think it is not, for the same reason that the two have slightly different implications for a module. With free threading (only) it is perfectly reasonable to have, for example, a global/static atomic variable. With sub-interpreters that is probably incorrect, because each sub-interpreter should have its own separate state. However, if a module is free-threading safe (so, thread safe) and it is multi-interpreter aware (as no global state) then it should also be own-gil-sub-interpreter safe... that is, it doesn't need GIL synchronization across the many multiple subinterpreter states, which must be true or it would not be free-threading safe.

wjakob · 2025-03-20T04:03:02Z

My 2cts: TLS in shared libraries is real disaster (especially C++ thread_local involving nontrivial types). The following writeup is enlightening: https://yosefk.com/blog/cxx-thread-local-storage-performance.html. FWIW I saw significant perf gains in nanobind by basically removing all use of TLS except for a handful of rare/complex cases.

So adding more TLS to the internals data structure in general sounds like a pretty significant performance sink. I would encourage you to thoroughly benchmark function/method calling on Windows/macOS/Linux to see how bad this is, and to what extent these costs can be mitigated.

henryiii · 2025-03-21T16:46:21Z

FYI, this is a perfect example of where I'd personally always rebase and force push. ;)

b-pass · 2025-03-22T00:25:30Z

My 2cts: TLS in shared libraries is real disaster (especially C++ thread_local involving nontrivial types). The following writeup is enlightening: https://yosefk.com/blog/cxx-thread-local-storage-performance.html. FWIW I saw significant perf gains in nanobind by basically removing all use of TLS except for a handful of rare/complex cases.

Luckily, these are all zero-initialized pointers, so no constructors. And I think multiple interpreters in multiple threads a the same time might qualify as a complex case. Still, point taken, on this expert advice I have made a bunch of changes to get the thread_local s completely out of the path of code that isn't running multiple interpreters. It takes just one branch to do so, so that's pretty much the best we could hope for performance wise.

Unfortunately, in the multi-interpreter case there really isn't any choice but to use thread_local. The TLS is there to avoid an even more costly dict lookup in the interpreter state dict.

b-pass · 2025-03-22T14:17:55Z

I would encourage you to thoroughly benchmark function/method calling on Windows/macOS/Linux to see how bad this is, and to what extent these costs can be mitigated.

Looks like, with the rewrite, the cost is an extra 0.22ns per call to get_details (an increase of about 15%). IMO that is a very small cost, unlikely any real-world usage will notice it.

The cost is much more significant when a subinterpreter is actually created, the cost to access the internals triples. In my opinion, that's just the cost of using subinterpreters.

... I've added this information to the PR description.

include/pybind11/detail/common.h

XuehaiPan · 2025-03-23T21:00:37Z

Hi, I wonder if you have read these:

To support sub-interpreters, I think we also need to implement PEP 573 and PEP 630. To clarify:

Managing Per-Module State

A new module-level state needs to be added. All types defined by py::class_ need to be heap types and need to be managed by the module state. Users also need to implement m_{traverse,clear,free} for the state and these classes. See gh-103092: Isolate _collections python/cpython#103093 as an example.
Module State Access from Functions & Module State Access from Classes & Module State Access from Regular Methods
Changing Static Types to Heap Types

I think it would be really hard to make PYBIND11_MODULE to both support module isolation while also keeping backward compatibility. Maybe we should add a new API PYBIND11_ISOLATED_MODULE. Or Opt-Out: Limiting to One Module Object per Process.

b-pass · 2025-03-23T22:25:01Z

To support sub-interpreters, I think we also need to implement PEP 573 and PEP 630.

The goal of these is to get rid of global state, and replace it with state that is tied to the module instance. While implementing this according to the python guides will definitely accomplish that goal -- and that might be the best way to do it --, that is not the only way to accomplish it. I think strict adherence to these would require major rewrites of several parts of pybind11, but I don't think that is necessary to support sub-interpreters/multi-interpreters. Definitely following PEP573 would make a module work with sub-interpreters. It is a sufficient condition but not a necessary condition.

Pybind11's global state is entirely contained within internals and local_internals. This PR makes these into per-interpreter state which is similar to how they behave today (when there is only one interpreter). But not into per-module state (how python suggests that state be managed). The existing implementation with global static state isn't compliant with PEP 573, this PR does not get pybind11 any closer to that either. Pybind11 has some problems that PEP 573 would solve already, and this PR doesn't fix them but it doesn't introduce any new issues AFAIK.

* [Changing Static Types to Heap Types](https://docs.python.org/3.13/howto/isolating-extensions.html#changing-static-types-to-heap-types)

I don't think this is required, since Pybind11's types are managed by its internals structures, they already are not globally static in the strict sense. Converting them to use Type_Spec and Slots is IMO unrelated to this PR.

(Edit: or, maybe pybind11 is already doing heap types? At least, some of them are...)

I think it would be really hard to make PYBIND11_MODULE to both support module isolation while also keeping backward compatibility.

I agree that is probably impossible. My goal here isn't full module isolation, pybind11 already doesn't have module isolation and it can't be added. That doesn't mean it can't support sub-interpreters. Maybe another way to think about it is that this PR adds interpreter isolation without adding module isolation. The examples you linked explain the kinds of problems that non-isolated modules have, which existing pybind11 modules all have, and they would continue to have after this PR.

henryiii · 2025-03-25T03:04:39Z

I've run @wjakob's benchmarks for nanobind on this PR. The binary size might be a bit off, since I can't run strip due to needing undefined symbols, but I don't think that has any affect here anyway. I couldn't get nanobind to load, so I had to take it off the runtime plot.

This seems to have a noticeable impact on debug (unoptimized) performance, but not really noticeable on runtime, probably within the uncertainty margins. I'd love for the runtime cost to go down (there's an old PR that did that, but not usable anymore) instead of up, but this looks acceptable to avoid complications building.

henryiii · 2025-03-28T19:14:27Z

include/pybind11/detail/internals.h

@@ -18,6 +18,7 @@
 #include <pybind11/conduit/pybind11_platform_abi_id.h>
 #include <pybind11/pytypes.h>

+#include <atomic>


Note to self since I checked: we are already using atomic in object.h (using atomic requires linking to libatomic on some platforms, like armv7l). I think we are actually missing that currently (https://github.com/scikit-hep/boost-histogram/blob/38ae735c07a9bbbbc80ca5ad9b57af106f61ef43/CMakeLists.txt#L91-L93 for example), but this isn't a new issue.

…nternals.h

…self.

…rence

Can't load the native-built cpp modules.

rwgk · 2025-05-10T16:32:44Z

Here is a simple ChatGPT prompt that produced a great (I think) automatic review:

https://chatgpt.com/share/681f7e39-6d58-8008-bee8-243ff1f3830a

It's looking very good, especially also considering that all GitHub Actions pass here.

@b-pass, if you're around, could you please take a look at the ChatGPT review? Are there low-hanging fruits you'd want to incorporate here?

I'll spend a little bit of time now on the "Follow-up Learning" homework ChatGPT assigned to me.

include/pybind11/pybind11.h

rwgk · 2025-05-10T18:32:49Z

tests/CMakeLists.txt

+if(NOT PYBIND11_CUDA_TESTS)
+  # This module doesn't get mixed with other test modules because those aren't subinterpreter safe.
+  pybind11_add_module(mod_test_interpreters THIN_LTO mod_test_interpreters.cpp)
+  set_target_properties(mod_test_interpreters PROPERTIES LIBRARY_OUTPUT_DIRECTORY
+                                                         "${CMAKE_CURRENT_BINARY_DIR}")
+  foreach(config ${CMAKE_CONFIGURATION_TYPES})
+    string(TOUPPER ${config} config)
+    set_target_properties(mod_test_interpreters PROPERTIES LIBRARY_OUTPUT_DIRECTORY_${config}
+                                                           "${CMAKE_CURRENT_BINARY_DIR}")
+  endforeach()
+  add_dependencies(pytest mod_test_interpreters)
+endif()
+


@iwanders Is there a chance that you could take a quick look here? Could this be simplified?

Hey @rwgk 👋, so I had never actually used this multi generator feature, so I had to read up on it a bit and it is cool, I'm happy I know about this now 👍

I created this gist; https://gist.github.com/iwanders/6296126beaabdbf1c14e2fb3f71c9fea that can be cloned to reproduce the issue that this cmake logic is solving.

This CMake sets three configuration types (set(CMAKE_CONFIGURATION_TYPES "Debug;Release;Profile") , and I used the Ninja Multi-Config generator.

After cloning that gist make a build directory next to the repository and run:

# This first command uses cmake to generate the ninja files ready for building multiple configurations: cmake -G "Ninja Multi-Config" ../repo/ # Then we build the Debug configuration cmake --build . --config Debug # And the Release configuration. cmake --build . --config Release

The first target mod_test_interpreters_problem shows the problem, if you have a multi configuration setup, you get a subdirectory in the CMAKE_CURRENT_BINARY_DIR destination directory:

$ tree problem/ problem/ ├── Debug │ └── libmod_test_interpreters_problem.so ├── Profile └── Release └── libmod_test_interpreters_problem.so

This is likely (I haven't checked that part) undesired as I expect the unit tests assume the shared object is in a fixed location and doesn't handle the multiple configurations part.

We get them in subdirectories beause LIBRARY_OUTPUT_DIRECTORY states:

Multi-configuration generators (Visual Studio, Xcode, Ninja Multi-Config) append a per-configuration subdirectory to the specified directory unless a generator expression is used.

That is what we see, and it is likely undesired, the solution in this PR is to set the LIBRARY_OUTPUT_DIRECTORY_ property for each of the configuration types. In the gist it prints out each retrieved property, it's exactly what one expects and in the built artifacts we do see only a single library in the same directory (the release build overwrote the debug artifact):

$ tree PR_output/ PR_output/ └── libmod_test_interpreters.so

So that places the library in the expected location where the unit test can find it.

As for simplification, if we read the previous quote again we can focus on the:

unless a generator expression is used

section, which means that if we write the set_target_properties using a generator-expression this behaviour of appending the configuration shouldn't happen. I don't actually know how to make a trivial generator expression that just echoes a variable, best I could think of was making a trivial if statement that's always true:

set_target_properties(mod_test_interpreters2 PROPERTIES LIBRARY_OUTPUT_DIRECTORY "$<1:${CMAKE_CURRENT_BINARY_DIR}>/generator_output")

Which shows that we also get only one library in the expected location:

$ tree generator_output/ generator_output/ └── libmod_test_interpreters2.so

So I think this can be written without the foreach by replacing the current non-config set_target_properties with the following:

set_target_properties(mod_test_interpreters PROPERTIES LIBRARY_OUTPUT_DIRECTORY "$<1:${CMAKE_CURRENT_BINARY_DIR}>")

I'm not sure if the rest of pybind11's cmake support multi-config generators, if they do you could use this. If they don't I'm not sure if I'd introduce the complexity for just this new test. Edit; Since you need part of the changes here anyway, might as well do the generator approach and support multi-config situations 👍

rwgk

I haven't looked much at the details of the production code changes yet, but based on ChatGPT's assessment, and my prior pass, I assume the low-level details are already in good shape. My main suggestion is to consolidate the mod_per_interpreter_gil, mod_multi_interpreter_one_gil bools into one enum, and to expand the testing.

tests/test_interpreters.py

tests/mod_test_interpreters.cpp

tests/test_interpreters.py

Update tests accordingly

b-pass · 2025-05-10T23:54:07Z

@b-pass, if you're around, could you please take a look at the ChatGPT review? Are there low-hanging fruits you'd want to incorporate here?

I renamed "get_interpreter_counter" to "get_num_interpreters_seen" (sort of as suggested, pick a more descriptive name). And the tag class.

…erpreters modules

b-pass requested a review from henryiii as a code owner March 16, 2025 01:25

b-pass force-pushed the subinterpreters branch 2 times, most recently from 59a2076 to 02e9609 Compare March 16, 2025 01:31

b-pass marked this pull request as draft March 16, 2025 01:32

b-pass force-pushed the subinterpreters branch 5 times, most recently from 236cc25 to 75d55f3 Compare March 16, 2025 02:05

b-pass marked this pull request as ready for review March 16, 2025 02:55

b-pass force-pushed the subinterpreters branch 2 times, most recently from df2fcc6 to e64d19f Compare March 16, 2025 19:30

henryiii reviewed Mar 19, 2025

View reviewed changes

rwgk reviewed Mar 19, 2025

View reviewed changes

b-pass mentioned this pull request Mar 20, 2025

feat: change PYBIND11_MODULE to use multi-phase init (PEP 489) #5574

Merged

b-pass force-pushed the subinterpreters branch from e64d19f to ecf287a Compare March 20, 2025 02:15

b-pass marked this pull request as draft March 20, 2025 22:57

b-pass force-pushed the subinterpreters branch from a6385de to 91316e8 Compare March 21, 2025 23:59

b-pass marked this pull request as ready for review March 22, 2025 14:17

rwgk reviewed Mar 22, 2025

View reviewed changes

include/pybind11/detail/common.h Show resolved Hide resolved

henryiii reviewed Mar 28, 2025

View reviewed changes

b-pass added 9 commits May 8, 2025 23:28

Fix initializer to make MSVC happy.

08c7239

Switch to gil_scoped_acquire_simple, get rid of old copy of it from i…

932c9b2

…nternals.h

Use the PyThreadState's interp member rather than the thread state it…

f728985

…self.

Be more explicit about the type of the internalspp

918bb87

Suggested renamings and rewordings

605179d

Rename find_internals_pp and change it to take in the state dict refe…

945e886

…rence

Use the old raise_from instead of pybind11_fail

628d40e

Move most of the internals initialization into its constructor.

76564dd

Move round_up_to_next_pow2 function upwards

78a1097

b-pass force-pushed the subinterpreters branch from 2402b31 to 385f33e Compare May 9, 2025 03:29

b-pass added 2 commits May 9, 2025 18:16

Remove redundant forward decl

5f5023e

Add a python-driven subinterpreter test

c3e51ff

b-pass force-pushed the subinterpreters branch from a323fab to c3e51ff Compare May 9, 2025 22:37

b-pass added 4 commits May 9, 2025 19:01

Disable the python subinterpreter test on emscripten

72ec43c

Can't load the native-built cpp modules.

Switch the internals pointer pointer to a unique_ptr pointer

3390554

Spelling

c3dd9ac

Fix clang-tidy warning, compare pointer to nullptr

8bcab79

rwgk reviewed May 10, 2025

View reviewed changes

include/pybind11/pybind11.h Outdated Show resolved Hide resolved

rwgk reviewed May 10, 2025

View reviewed changes

b-pass added 4 commits May 10, 2025 18:22

Rename get_interpreter_counter to get_num_interpreters_seen

1e48618

Try simplifying the test's cmake set_target_properties

1336929

Replace mod_* tags with a single tag w/enum

d7420ca

Update tests accordingly

Add a test for shared-GIL (legacy) subinterpreters

eec1cae

b-pass force-pushed the subinterpreters branch from 9831611 to eec1cae Compare May 10, 2025 23:42

b-pass added 2 commits May 10, 2025 20:03

Update test to work around differences in the various versions of int…

d346aa0

…erpreters modules

Fix unused parameter

92897fe

b-pass force-pushed the subinterpreters branch from a856d97 to 92897fe Compare May 11, 2025 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for sub-interpreters #5564

Support for sub-interpreters #5564

b-pass commented Mar 16, 2025 •

edited

Loading

henryiii Mar 19, 2025

b-pass Mar 22, 2025

henryiii commented Mar 19, 2025

rwgk left a comment

b-pass commented Mar 20, 2025

wjakob commented Mar 20, 2025

henryiii commented Mar 21, 2025

b-pass commented Mar 22, 2025

b-pass commented Mar 22, 2025

XuehaiPan commented Mar 23, 2025 •

edited

Loading

b-pass commented Mar 23, 2025 •

edited

Loading

henryiii commented Mar 25, 2025

henryiii Mar 28, 2025

rwgk commented May 10, 2025

rwgk May 10, 2025

iwanders May 10, 2025 •

edited

Loading

rwgk left a comment

b-pass commented May 10, 2025

Support for sub-interpreters #5564

Are you sure you want to change the base?

Support for sub-interpreters #5564

Conversation

b-pass commented Mar 16, 2025 • edited Loading

Description

Multiphase init

internals

local_internals

Performance

Memory management / Future work

Suggested changelog entry:

henryiii Mar 19, 2025

Choose a reason for hiding this comment

b-pass Mar 22, 2025

Choose a reason for hiding this comment

henryiii commented Mar 19, 2025

rwgk left a comment

Choose a reason for hiding this comment

b-pass commented Mar 20, 2025

wjakob commented Mar 20, 2025

henryiii commented Mar 21, 2025

b-pass commented Mar 22, 2025

b-pass commented Mar 22, 2025

XuehaiPan commented Mar 23, 2025 • edited Loading

b-pass commented Mar 23, 2025 • edited Loading

henryiii commented Mar 25, 2025

henryiii Mar 28, 2025

Choose a reason for hiding this comment

rwgk commented May 10, 2025

rwgk May 10, 2025

Choose a reason for hiding this comment

iwanders May 10, 2025 • edited Loading

Choose a reason for hiding this comment

rwgk left a comment

Choose a reason for hiding this comment

b-pass commented May 10, 2025

b-pass commented Mar 16, 2025 •

edited

Loading

XuehaiPan commented Mar 23, 2025 •

edited

Loading

b-pass commented Mar 23, 2025 •

edited

Loading

iwanders May 10, 2025 •

edited

Loading