Interaction of `DLManagedTensor::deleter` and the GIL #103

wjakob · 2022-04-04T14:44:37Z

Dear DLPack authors,

DLPack is widely used to enable interoperability between large C++-based frameworks that furthermore provide Python bindings. The C++ parts of such a framework are often usable through an independent C++ API (e.g. in the case of PyTorch, Tensorflow, ..), where functions can be called from multi-threaded code. In contrast, multithreading in Python is much more restricted: any use of the CPython API requires that the Global Interpreter Lock (GIL) is held.

This discrepancy has implications on projects mixing C++ and Python code. Function in such a mixed C++/Python codebase should clarify require whether the caller should ensure that the GIL is held.

This is currently unclear in the case of DLManagedTensor::deleter, which could be called by the PyCapsule destructor (where the GIL is held) or from arbitrary C++ code at some later point following "consumption" of the capsule (where the GIL is not necessarily held -- this destructor call could even occur from a different thread!)

I don't have any strong opinions either way, but ideally the documentation of the interfaces should say so.

Thanks,
Wenzel

The text was updated successfully, but these errors were encountered:

tqchen · 2022-04-04T14:55:24Z

Thanks @wjakob ! Indeed this is worth clarifying.

As far as I know, GIL is not necessary if the deleter is defined in C API. There can be cases where the deleter itself is defined by python's FFI mechanism, in such case, the GIL holding could happen inside the deleter, but nevertheless not the job of the caller of the deleter

rgommers · 2022-04-05T08:27:14Z

The C docs clearly should not know anything about the GIL. As for the Python docs, this section talks about transfer of ownership of DLManangedTensor to the consumer: https://dmlc.github.io/dlpack/latest/python_spec.html#implementation. The dlpack_capsule_deleter directly uses the Python C API, so clearly already holds the GIL. The deleter can only be called once within dlpack_capsule_deleter (there's an early exit at the top). So it seems pretty clear at the moment.

@wjakob what exact comment/phrasing would help you here?

wjakob · 2022-04-05T08:36:16Z

How about the following?

When the capsule has been consumed, the consumer library is responsible for calling DLManagedTensor::deleter. From the viewpoint of the producer, no guarantees are given with respect to the execution context of this subsequent function call. For example, the call could take place on a different thread, and the Python GIL may or may not be held. It is the responsibility of the deleter implementation to handle such conditions using synchronization primitives, if needed.

rgommers · 2022-04-05T08:45:12Z

It is the responsibility of the deleter implementation to handle such conditions using synchronization primitives, if needed.

Thanks, looks reasonable - except "deleter implementation" here seems wrong (it's the producer that implements deleter; you want to say that it's the caller of deleter's responsibility.

wjakob · 2022-04-05T20:34:00Z

Really? Maybe I am misunderstanding something. My mental picture is:

Mostly Python-based Framework X wraps a tensor in a DLPack capsule ("producer").
Mostly C++-based framework Y consumes tensor, later releases it from another thread.

Suppose Framework X needs to run a few python operations to clean up tensor that is now unreferenced. It needs to acquire the GIL to invoke those Python API calls safely. However, the part of Framework Y that released the consumed tensor has no idea about Python or the GIL.

So it's the responsibility of the implementation of the deleter (provided by framework X) to first acquire the GIL before subsequent steps.

tqchen · 2022-04-05T20:58:24Z

@wjakob I think your interpretation is correct. However, most of the frameworks implement the allocation of the object themselves in C/C++/cython, then expose to python, in which case the deleter donot need to acquire GIL.

leofang · 2022-04-05T21:11:49Z

It is easier for the deleter caller to hold the GIL, because when the caller retrieves the pointer from the capsule using PyCapsule_GetPointer it already needs to see a valid Python context. (We didn't require copying a DLManagedTensor when the consumer first receives the capsule, so it is safer that we only consider accessing the deleter from a capsule.)

wjakob · 2022-04-06T08:48:39Z

@leofang I think part of the confusion is that there are effectively two "deleters", and perhaps it would be good to come up with some kind of terminology in the documentation to avoid ambiguity.

The first is the capsule deleter. It is invoked when the capsule expires, and it may or may not call DLManagedTensor::deleter depending on whether the capsule is marked consumed. The GIL is always held in this case.
The second is the DLPack tensor deleter (DLManagedTensor::deleter field). It may be called by an expired capsule (in which case the GIL is held), but alternatively it could also be called by an external framework that is now using the tensor. This call could originate from another thread and in a context where it may not be straightforward to acquire the GIL (e.g. when called by the C++ portion of another framework that doesn't even link to libpython).

So my suggestion would be to add a sentence to the documentation saying that DLManagedTensor::deleter may be called from an arbitrary context (where "context" refers to caller thread, and status of locks like the GIL). This means that additional checks and synchronization may be needed to safely issue Python C API calls.

Hypothetical example: what if manager_ctx is actually pointer to a PyObject * that backs the tensor storage. In that case, we need to issue a reference counting operation that acquires the GIL before Py_DECREF can be safely executed. There is no harm in doing this if the GIL is already held, but it avoids undefined behavior when my_deleter is called on another thread by a C++-based framework.

void my_deleter(struct DLManagedTensor * self) {
    PyGILState_STATE state = PyGILState_Ensure();
    Py_DECREF((PyObject *) (self->manager_ctx));
    PyGILState_Release(state);
}

rgommers · 2022-04-08T19:15:04Z

I think part of the confusion is that there are effectively two "deleters", [...]

I'm not sure it's that confusing; at least to me "the deleter" is always DLManagedTensor::deleter. It is the only thing that is being actively called.

So my suggestion would be to add a sentence to the documentation saying that DLManagedTensor::deleter may be called from an arbitrary context (where "context" refers to caller thread, and status of locks like the GIL). This means that additional checks and synchronization may be needed to safely issue Python C API calls.

This sentence makes sense to me. It probably needs some context that the remark is intended for implementers of a Python library, who may not be aware that there are non-Python DLPack users. Otherwise it's more confusing than helpful for C/C++ users, who will wonder why on earth they need to bother with a lock from a language they're not using.

seberg · 2022-04-11T21:08:03Z

Hypothetical example: what if manager_ctx is actually pointer to a PyObject * that backs the tensor storage.

There is nothing hypothetical about this. This is the current way NumPy, cupy, and probably half of the other Python providers work. So this is a very important clarification that many implementations will violate as soon as it is formalized.
(They probably mostly just use decref, so it is very unlikely for problems to occur in practice, but it is a clear violation.)

wjakob · 2022-04-20T11:37:57Z

Py_DECREF can trigger arbitrary python code to run, and it also uses a non-atomic decrement instruction that may race with other reference count changes done in an independently running Python thread. You really do need to hold the GIL when issuing Python API calls.

My suggestion would be that the my_deleter example posted above is included in the documentation to clarify this potentially dangerous situation.

seberg · 2022-04-20T15:11:56Z

Yes, it seems that it should probably be written that "the deleter MUST NOT assume that the GIL is held when it is called" (for Python but applies to any other environment).
The caveat that should maybe be noted: That this is not necessarily true for current implementations. So users should ideally only rely on that for the "next" version of __dlpack__.

pitrou · 2023-10-26T15:28:27Z

Yes, it seems that it should probably be written that "the deleter MUST NOT assume that the GIL is held when it is called" (for Python but applies to any other environment).

Noisy +1 to this. This would make things clearer for exporters of non-Python data (such as PyArrow).

leofang · 2024-03-22T00:00:08Z

@tqchen @seberg Do you think we should update the docs before tagging DLPack 1.0?

tqchen · 2024-03-22T02:34:10Z

feel free to send a PR

I honestly don't remember if that ``Py_IsInitialized`` wasn't just to tape over C++ related deficiencies, but I suspect there isn't much to be avoided there... (I.e. if we take care of the GIL for the other library, we also have to do this check for them which would normally be their job.) Closes dmlcgh-103

seberg · 2024-03-22T08:15:36Z

gh-140

* DOC: Document that GIL must be grabbed (and Py_IsInitialized()) I honestly don't remember if that ``Py_IsInitialized`` wasn't just to tape over C++ related deficiencies, but I suspect there isn't much to be avoided there... (I.e. if we take care of the GIL for the other library, we also have to do this check for them which would normally be their job.) Closes gh-103 * Apply suggestions from code review Co-authored-by: Antoine Pitrou <[email protected]> * Update docs/source/python_spec.rst Co-authored-by: Antoine Pitrou <[email protected]> * Add note that this of course only applies if it uses Python objects/API --------- Co-authored-by: Antoine Pitrou <[email protected]>

This was referenced Nov 1, 2022

BUG: array_dlpack_deleter does not take the GIL before calling into CPython API numpy/numpy#22507

Closed

MAINT: Ensure raw dlpack deleter works when called without the GIL numpy/numpy#22510

Merged

leofang mentioned this issue Oct 26, 2023

Unclear if tensor deleter requires the caller to take the GIL #129

Closed

seberg mentioned this issue Mar 22, 2024

DOC: Document that GIL must be grabbed (and Py_IsInitialized()) #140

Merged

tqchen closed this as completed in #140 Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction of `DLManagedTensor::deleter` and the GIL #103

Interaction of `DLManagedTensor::deleter` and the GIL #103

wjakob commented Apr 4, 2022

tqchen commented Apr 4, 2022

rgommers commented Apr 5, 2022

wjakob commented Apr 5, 2022 •

edited

Loading

rgommers commented Apr 5, 2022

wjakob commented Apr 5, 2022

tqchen commented Apr 5, 2022 •

edited

Loading

leofang commented Apr 5, 2022

wjakob commented Apr 6, 2022

rgommers commented Apr 8, 2022

seberg commented Apr 11, 2022

wjakob commented Apr 20, 2022

seberg commented Apr 20, 2022

pitrou commented Oct 26, 2023

leofang commented Mar 22, 2024

tqchen commented Mar 22, 2024

seberg commented Mar 22, 2024

Interaction of DLManagedTensor::deleter and the GIL #103

Interaction of DLManagedTensor::deleter and the GIL #103

Comments

wjakob commented Apr 4, 2022

tqchen commented Apr 4, 2022

rgommers commented Apr 5, 2022

wjakob commented Apr 5, 2022 • edited Loading

rgommers commented Apr 5, 2022

wjakob commented Apr 5, 2022

tqchen commented Apr 5, 2022 • edited Loading

leofang commented Apr 5, 2022

wjakob commented Apr 6, 2022

rgommers commented Apr 8, 2022

seberg commented Apr 11, 2022

wjakob commented Apr 20, 2022

seberg commented Apr 20, 2022

pitrou commented Oct 26, 2023

leofang commented Mar 22, 2024

tqchen commented Mar 22, 2024

seberg commented Mar 22, 2024

Interaction of `DLManagedTensor::deleter` and the GIL #103

Interaction of `DLManagedTensor::deleter` and the GIL #103

wjakob commented Apr 5, 2022 •

edited

Loading

tqchen commented Apr 5, 2022 •

edited

Loading