Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction of DLManagedTensor::deleter and the GIL #103

Closed
wjakob opened this issue Apr 4, 2022 · 16 comments · Fixed by #140
Closed

Interaction of DLManagedTensor::deleter and the GIL #103

wjakob opened this issue Apr 4, 2022 · 16 comments · Fixed by #140

Comments

@wjakob
Copy link

wjakob commented Apr 4, 2022

Dear DLPack authors,

DLPack is widely used to enable interoperability between large C++-based frameworks that furthermore provide Python bindings. The C++ parts of such a framework are often usable through an independent C++ API (e.g. in the case of PyTorch, Tensorflow, ..), where functions can be called from multi-threaded code. In contrast, multithreading in Python is much more restricted: any use of the CPython API requires that the Global Interpreter Lock (GIL) is held.

This discrepancy has implications on projects mixing C++ and Python code. Function in such a mixed C++/Python codebase should clarify require whether the caller should ensure that the GIL is held.

This is currently unclear in the case of DLManagedTensor::deleter, which could be called by the PyCapsule destructor (where the GIL is held) or from arbitrary C++ code at some later point following "consumption" of the capsule (where the GIL is not necessarily held -- this destructor call could even occur from a different thread!)

I don't have any strong opinions either way, but ideally the documentation of the interfaces should say so.

Thanks,
Wenzel

@tqchen
Copy link
Member

tqchen commented Apr 4, 2022

Thanks @wjakob ! Indeed this is worth clarifying.

As far as I know, GIL is not necessary if the deleter is defined in C API. There can be cases where the deleter itself is defined by python's FFI mechanism, in such case, the GIL holding could happen inside the deleter, but nevertheless not the job of the caller of the deleter

@rgommers
Copy link
Contributor

rgommers commented Apr 5, 2022

The C docs clearly should not know anything about the GIL. As for the Python docs, this section talks about transfer of ownership of DLManangedTensor to the consumer: https://dmlc.github.io/dlpack/latest/python_spec.html#implementation. The dlpack_capsule_deleter directly uses the Python C API, so clearly already holds the GIL. The deleter can only be called once within dlpack_capsule_deleter (there's an early exit at the top). So it seems pretty clear at the moment.

@wjakob what exact comment/phrasing would help you here?

@wjakob
Copy link
Author

wjakob commented Apr 5, 2022

How about the following?

When the capsule has been consumed, the consumer library is responsible for calling DLManagedTensor::deleter. From the viewpoint of the producer, no guarantees are given with respect to the execution context of this subsequent function call. For example, the call could take place on a different thread, and the Python GIL may or may not be held. It is the responsibility of the deleter implementation to handle such conditions using synchronization primitives, if needed.

@rgommers
Copy link
Contributor

rgommers commented Apr 5, 2022

It is the responsibility of the deleter implementation to handle such conditions using synchronization primitives, if needed.

Thanks, looks reasonable - except "deleter implementation" here seems wrong (it's the producer that implements deleter; you want to say that it's the caller of deleter's responsibility.

@wjakob
Copy link
Author

wjakob commented Apr 5, 2022

Really? Maybe I am misunderstanding something. My mental picture is:

  • Mostly Python-based Framework X wraps a tensor in a DLPack capsule ("producer").
  • Mostly C++-based framework Y consumes tensor, later releases it from another thread.

Suppose Framework X needs to run a few python operations to clean up tensor that is now unreferenced. It needs to acquire the GIL to invoke those Python API calls safely. However, the part of Framework Y that released the consumed tensor has no idea about Python or the GIL.

So it's the responsibility of the implementation of the deleter (provided by framework X) to first acquire the GIL before subsequent steps.

@tqchen
Copy link
Member

tqchen commented Apr 5, 2022

@wjakob I think your interpretation is correct. However, most of the frameworks implement the allocation of the object themselves in C/C++/cython, then expose to python, in which case the deleter donot need to acquire GIL.

@leofang
Copy link
Contributor

leofang commented Apr 5, 2022

It is easier for the deleter caller to hold the GIL, because when the caller retrieves the pointer from the capsule using PyCapsule_GetPointer it already needs to see a valid Python context. (We didn't require copying a DLManagedTensor when the consumer first receives the capsule, so it is safer that we only consider accessing the deleter from a capsule.)

@wjakob
Copy link
Author

wjakob commented Apr 6, 2022

@leofang I think part of the confusion is that there are effectively two "deleters", and perhaps it would be good to come up with some kind of terminology in the documentation to avoid ambiguity.

  1. The first is the capsule deleter. It is invoked when the capsule expires, and it may or may not call DLManagedTensor::deleter depending on whether the capsule is marked consumed. The GIL is always held in this case.

  2. The second is the DLPack tensor deleter (DLManagedTensor::deleter field). It may be called by an expired capsule (in which case the GIL is held), but alternatively it could also be called by an external framework that is now using the tensor. This call could originate from another thread and in a context where it may not be straightforward to acquire the GIL (e.g. when called by the C++ portion of another framework that doesn't even link to libpython).

So my suggestion would be to add a sentence to the documentation saying that DLManagedTensor::deleter may be called from an arbitrary context (where "context" refers to caller thread, and status of locks like the GIL). This means that additional checks and synchronization may be needed to safely issue Python C API calls.

Hypothetical example: what if manager_ctx is actually pointer to a PyObject * that backs the tensor storage. In that case, we need to issue a reference counting operation that acquires the GIL before Py_DECREF can be safely executed. There is no harm in doing this if the GIL is already held, but it avoids undefined behavior when my_deleter is called on another thread by a C++-based framework.

void my_deleter(struct DLManagedTensor * self) {
    PyGILState_STATE state = PyGILState_Ensure();
    Py_DECREF((PyObject *) (self->manager_ctx));
    PyGILState_Release(state);
}

@rgommers
Copy link
Contributor

rgommers commented Apr 8, 2022

I think part of the confusion is that there are effectively two "deleters", [...]

I'm not sure it's that confusing; at least to me "the deleter" is always DLManagedTensor::deleter. It is the only thing that is being actively called.

So my suggestion would be to add a sentence to the documentation saying that DLManagedTensor::deleter may be called from an arbitrary context (where "context" refers to caller thread, and status of locks like the GIL). This means that additional checks and synchronization may be needed to safely issue Python C API calls.

This sentence makes sense to me. It probably needs some context that the remark is intended for implementers of a Python library, who may not be aware that there are non-Python DLPack users. Otherwise it's more confusing than helpful for C/C++ users, who will wonder why on earth they need to bother with a lock from a language they're not using.

@seberg
Copy link
Contributor

seberg commented Apr 11, 2022

Hypothetical example: what if manager_ctx is actually pointer to a PyObject * that backs the tensor storage.

There is nothing hypothetical about this. This is the current way NumPy, cupy, and probably half of the other Python providers work. So this is a very important clarification that many implementations will violate as soon as it is formalized.
(They probably mostly just use decref, so it is very unlikely for problems to occur in practice, but it is a clear violation.)

@wjakob
Copy link
Author

wjakob commented Apr 20, 2022

Py_DECREF can trigger arbitrary python code to run, and it also uses a non-atomic decrement instruction that may race with other reference count changes done in an independently running Python thread. You really do need to hold the GIL when issuing Python API calls.

My suggestion would be that the my_deleter example posted above is included in the documentation to clarify this potentially dangerous situation.

@seberg
Copy link
Contributor

seberg commented Apr 20, 2022

Yes, it seems that it should probably be written that "the deleter MUST NOT assume that the GIL is held when it is called" (for Python but applies to any other environment).
The caveat that should maybe be noted: That this is not necessarily true for current implementations. So users should ideally only rely on that for the "next" version of __dlpack__.

@pitrou
Copy link
Contributor

pitrou commented Oct 26, 2023

Yes, it seems that it should probably be written that "the deleter MUST NOT assume that the GIL is held when it is called" (for Python but applies to any other environment).

Noisy +1 to this. This would make things clearer for exporters of non-Python data (such as PyArrow).

@leofang
Copy link
Contributor

leofang commented Mar 22, 2024

@tqchen @seberg Do you think we should update the docs before tagging DLPack 1.0?

@tqchen
Copy link
Member

tqchen commented Mar 22, 2024

feel free to send a PR

seberg added a commit to seberg/dlpack that referenced this issue Mar 22, 2024
I honestly don't remember if that ``Py_IsInitialized`` wasn't just
to tape over C++ related deficiencies, but I suspect there isn't
much to be avoided there...

(I.e. if we take care of the GIL for the other library, we also
have to do this check for them which would normally be their job.)

Closes dmlcgh-103
@seberg
Copy link
Contributor

seberg commented Mar 22, 2024

gh-140

tqchen pushed a commit that referenced this issue Mar 22, 2024
* DOC: Document that GIL must be grabbed (and Py_IsInitialized())

I honestly don't remember if that ``Py_IsInitialized`` wasn't just
to tape over C++ related deficiencies, but I suspect there isn't
much to be avoided there...

(I.e. if we take care of the GIL for the other library, we also
have to do this check for them which would normally be their job.)

Closes gh-103

* Apply suggestions from code review

Co-authored-by: Antoine Pitrou <[email protected]>

* Update docs/source/python_spec.rst

Co-authored-by: Antoine Pitrou <[email protected]>

* Add note that this of course only applies if it uses Python objects/API

---------

Co-authored-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants