Skip to content

Add dem_sampling CPU/GPU across C++ and Python#479

Open
kvmto wants to merge 17 commits intoNVIDIA:mainfrom
kvmto:dem_sampling_pr
Open

Add dem_sampling CPU/GPU across C++ and Python#479
kvmto wants to merge 17 commits intoNVIDIA:mainfrom
kvmto:dem_sampling_pr

Conversation

@kvmto
Copy link
Copy Markdown
Collaborator

@kvmto kvmto commented Apr 1, 2026

Summary

  • Add dem_sampling in C++ with both CPU and cuStabilizer-backed GPU paths.
  • Expose the feature through pybind and Python (cudaq_qec.dem_sampling) with backend selection (auto / cpu / gpu), plus PyTorch tensor support including CUDA device-pointer flow for GPU execution.
  • Add end-to-end coverage for C++ and Python paths (CPU/GPU, NumPy/PyTorch) and wire build/packaging so cuStabilizer is discovered and can be required for shipping builds.

Build/Packaging updates

  • Add FindcuStabilizer CMake module and QEC CMake integration.
  • Add CUDAQ_QEC_REQUIRE_CUSTABILIZER enforcement path for builds that must ship GPU support.
  • Update libs/qec/pyproject.toml.cu12 and libs/qec/pyproject.toml.cu13 for dem_sampling optional dependencies (including torch + cuquantum extras).

Test plan

  • C++: run dem_sampling unit tests (DemSamplingCPU, DemSamplingGPU, and related QEC integration tests) in CI.
  • Python: run libs/qec/python/tests/test_dem_sampling.py in CI with CUDA-enabled torch.

Introduce dem_sampling implementations for CPU and cuStabilizer-backed GPU paths in C++, and expose them through pybind/Python with torch tensor and device-pointer support. Add C++/Python coverage for backend paths and wire build/packaging checks so cuStabilizer requirements are enforced for shipping.

Signed-off-by: kvmto <kmato@nvidia.com>
@kvmto kvmto requested review from bmhowe23, ivanbasov and wsttiger April 1, 2026 16:02
kvmto added 2 commits April 1, 2026 16:03
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
"tensorrt-cu12"
]
dem_sampling = [
"torch",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do DEM sampling without Torch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes with numpy, which is the most likely case for the usage of the cpu implementation

kvmto added 14 commits April 1, 2026 22:08
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
- Use cudaMallocAsync/cudaFreeAsync for all GPU temporaries to avoid
  implicit device synchronization that breaks multi-stream concurrency
  (critical for PyTorch CUDA stream integration)
- Replace synchronous cudaMemcpy with cudaMemcpyAsync on the caller's
  stream for the probability D->H copy
- Add grid dimension overflow guards before every CUDA kernel launch
- Handle numShots=0 gracefully in both C++ CPU path and Python binding
- Binarize check_matrix with & 1u in CPU path to match GPU kernel
  behavior and prevent uint8 dot-product overflow
- Clear sticky CUDA errors (cudaGetLastError) on all failure paths in
  the Python binding's GPU allocation/copy helpers
- Fix pre-existing test_non_default_cuda_stream assertion that compared
  torch.device("cuda") against torch.device("cuda", index=0)
- Add 12 new tests covering zero-shot edge case, non-binary H matrix
  CPU/GPU parity, and seedless code path (5 C++, 7 Python)

Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants