Add dem_sampling CPU/GPU across C++ and Python#479
Open
kvmto wants to merge 17 commits intoNVIDIA:mainfrom
Open
Add dem_sampling CPU/GPU across C++ and Python#479kvmto wants to merge 17 commits intoNVIDIA:mainfrom
kvmto wants to merge 17 commits intoNVIDIA:mainfrom
Conversation
Introduce dem_sampling implementations for CPU and cuStabilizer-backed GPU paths in C++, and expose them through pybind/Python with torch tensor and device-pointer support. Add C++/Python coverage for backend paths and wire build/packaging checks so cuStabilizer requirements are enforced for shipping. Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
bmhowe23
reviewed
Apr 1, 2026
libs/qec/pyproject.toml.cu12
Outdated
| "tensorrt-cu12" | ||
| ] | ||
| dem_sampling = [ | ||
| "torch", |
Collaborator
There was a problem hiding this comment.
Is there a way to do DEM sampling without Torch?
Collaborator
Author
There was a problem hiding this comment.
yes with numpy, which is the most likely case for the usage of the cpu implementation
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
- Use cudaMallocAsync/cudaFreeAsync for all GPU temporaries to avoid
implicit device synchronization that breaks multi-stream concurrency
(critical for PyTorch CUDA stream integration)
- Replace synchronous cudaMemcpy with cudaMemcpyAsync on the caller's
stream for the probability D->H copy
- Add grid dimension overflow guards before every CUDA kernel launch
- Handle numShots=0 gracefully in both C++ CPU path and Python binding
- Binarize check_matrix with & 1u in CPU path to match GPU kernel
behavior and prevent uint8 dot-product overflow
- Clear sticky CUDA errors (cudaGetLastError) on all failure paths in
the Python binding's GPU allocation/copy helpers
- Fix pre-existing test_non_default_cuda_stream assertion that compared
torch.device("cuda") against torch.device("cuda", index=0)
- Add 12 new tests covering zero-shot edge case, non-binary H matrix
CPU/GPU parity, and seedless code path (5 C++, 7 Python)
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dem_samplingin C++ with both CPU and cuStabilizer-backed GPU paths.cudaq_qec.dem_sampling) with backend selection (auto/cpu/gpu), plus PyTorch tensor support including CUDA device-pointer flow for GPU execution.Build/Packaging updates
FindcuStabilizerCMake module and QEC CMake integration.CUDAQ_QEC_REQUIRE_CUSTABILIZERenforcement path for builds that must ship GPU support.libs/qec/pyproject.toml.cu12andlibs/qec/pyproject.toml.cu13fordem_samplingoptional dependencies (including torch + cuquantum extras).Test plan
dem_samplingunit tests (DemSamplingCPU,DemSamplingGPU, and related QEC integration tests) in CI.libs/qec/python/tests/test_dem_sampling.pyin CI with CUDA-enabled torch.