-
Couldn't load subscription status.
- Fork 42
Handle cuda.core.Stream in driver operations
#401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle cuda.core.Stream in driver operations
#401
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
This comment was marked as outdated.
This comment was marked as outdated.
|
/ok to test |
|
/ok to test |
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this PR closes #151, per issue suggests that we can pass a cuda core stream object via kernel launch interface, but this PR is missing a test for this use case.
|
/ok to test |
| acceptable stream objects. Acceptable types are | ||
| int (0 for default stream), Stream, ExperimentalStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the docstring outdated? int is currently not allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for the special value 0 I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider deprecating allowing passing 0 as a Stream? The "default stream" is ambiguous in Python since PTDS is normally a host compile-time concept. We have an environment variable for controlling it in cuda.bindings / cuda.core: CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM which I think should be generally used.
It would be great if we could introduce a deprecation warning in some form to passing 0 as a Stream in user facing APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the user perspective we're deprecating the apis fully in #546, so those should be gone entirely. But we should do a sweep and make sure we're being explicit with all our usages of streams internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of the DeviceNDArray class, I think streams are accepted when launching kernels and using the Event APIs as well where we should properly handle there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launching is tested as part of this PR, events added in 7df62ce though.
| """ | ||
| Memset on the device. | ||
| If stream is 0, the call is synchronous. | ||
| If stream is a Stream object, asynchronous mode is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug (or change or behavior) here and elsewhere. stream can be a Stream object from either numba-cuda or cuda.core, but still holds 0 (the default stream) under the hood. However, the call now becomes asynchronous (with respect to the host) instead of synchronous. Just wanted to call it out in case it was not the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good catch. As a follow up to this, is the output here as expected, where dev is a cuda.core.experimental.Device for whom set_current() has been called? Should it not be (0, 0)?
>>> dev.default_stream.__cuda_stream__()
(0, 1)
I ask hoping there's a reliable way of detecting this situation based on the passed object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a while searching around the codebase I concluded this was at least the original intention, though these are really only used for the deprecated device array API:
If a CUDA ``stream`` is given, then the transfer will be made
asynchronously as part as the given stream. Otherwise, the transfer is
synchronous: the function returns after the copy is finished.
So AFAICT this PR maintains the above behavior just with a new stream object. Ultimately though I'm not sure we should spend too much time thinking about it as these will be removed and users performing these types of memory transfers should use either cupy for a nice array API or cuda.bindings for full control of things like synchronization behavior.
| fn(*args) | ||
|
|
||
|
|
||
| def device_to_host(dst, src, size, stream=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned below (or above), stream semantics is changed which probably has a bigger impact to this method, because the copy is now asynchronous and to access src on host a stream synchronization is needed.
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
Closes #151