Skip to content

Conversation

@brandon-b-miller
Copy link
Contributor

Closes #151

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 18, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@CLAassistant

This comment was marked as outdated.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this PR closes #151, per issue suggests that we can pass a cuda core stream object via kernel launch interface, but this PR is missing a test for this use case.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Comment on lines +3538 to +3539
acceptable stream objects. Acceptable types are
int (0 for default stream), Stream, ExperimentalStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the docstring outdated? int is currently not allowed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for the special value 0 I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider deprecating allowing passing 0 as a Stream? The "default stream" is ambiguous in Python since PTDS is normally a host compile-time concept. We have an environment variable for controlling it in cuda.bindings / cuda.core: CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM which I think should be generally used.

It would be great if we could introduce a deprecation warning in some form to passing 0 as a Stream in user facing APIs.

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user perspective we're deprecating the apis fully in #546, so those should be gone entirely. But we should do a sweep and make sure we're being explicit with all our usages of streams internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the DeviceNDArray class, I think streams are accepted when launching kernels and using the Event APIs as well where we should properly handle there as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launching is tested as part of this PR, events added in 7df62ce though.

"""
Memset on the device.
If stream is 0, the call is synchronous.
If stream is a Stream object, asynchronous mode is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug (or change or behavior) here and elsewhere. stream can be a Stream object from either numba-cuda or cuda.core, but still holds 0 (the default stream) under the hood. However, the call now becomes asynchronous (with respect to the host) instead of synchronous. Just wanted to call it out in case it was not the intention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good catch. As a follow up to this, is the output here as expected, where dev is a cuda.core.experimental.Device for whom set_current() has been called? Should it not be (0, 0)?

>>> dev.default_stream.__cuda_stream__()
(0, 1)

I ask hoping there's a reliable way of detecting this situation based on the passed object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a while searching around the codebase I concluded this was at least the original intention, though these are really only used for the deprecated device array API:

        If a CUDA ``stream`` is given, then the transfer will be made
        asynchronously as part as the given stream.  Otherwise, the transfer is
        synchronous: the function returns after the copy is finished.

So AFAICT this PR maintains the above behavior just with a new stream object. Ultimately though I'm not sure we should spend too much time thinking about it as these will be removed and users performing these types of memory transfers should use either cupy for a nice array API or cuda.bindings for full control of things like synchronization behavior.

fn(*args)


def device_to_host(dst, src, size, stream=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned below (or above), stream semantics is changed which probably has a bigger impact to this method, because the copy is now asynchronous and to access src on host a stream synchronization is needed.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller brandon-b-miller added the 3 - Ready for Review Ready for review by team label Oct 13, 2025
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller brandon-b-miller merged commit 39066c7 into NVIDIA:main Oct 27, 2025
70 checks passed
@brandon-b-miller brandon-b-miller deleted the cuda-core-streams branch October 27, 2025 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Make cuda.core.Stream recognized by numba-cuda by supporting the __cuda_stream__ protocol

5 participants