Migrates concurrent root LP to OpenMP#1291
Conversation
# Conflicts: # python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py
The cuDSS warmup and eager barrier raft::handle_t construction on the main thread were added to keep cuBLAS / cuSPARSE / cuSolverDn / cuDSS first-use synchronous init from invalidating PDLP's CUDA graph capture. manual_cuda_graph_t (from cuda_graph_side_capture) now recovers from cudaErrorStreamCaptureInvalidated by re-running the captured work eagerly, so the preflight is no longer needed. Move barrier handle construction back inside the barrier OMP task body, matching the pre-warmup pattern. Destruction stays at parent scope (post-taskgroup join) so cublasDestroy → cudaDeviceSynchronize doesn't fire during a capture.
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
📝 WalkthroughWalkthroughThis PR refactors the concurrent LP solver to use OpenMP taskgroups instead of std::thread spawning. A new compile-time constant gates barrier execution when running under MIP with insufficient thread availability. The barrier and dual simplex now dispatch as OpenMP tasks, with the barrier constructing its own CUDA context inside the task and results conditionally processed based on execution eligibility. ChangesConcurrent LP solver refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/pdlp/solve.cu (1)
40-61:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd explicit
<atomic>include forstd::atomicusage incpp/src/pdlp/solve.cu
std::atomic<int> global_concurrent_halt{0};is used at line 331, but this TU has no#include <atomic>, so it relies on transitive includes.Suggested fix
`#include` <algorithm> +#include <atomic> `#include` <cmath> `#include` <exception> `#include` <set>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 40 - 61, This TU uses std::atomic<int> via the symbol global_concurrent_halt but does not include <atomic>, relying on transitive includes; add an explicit `#include` <atomic> near the other headers in solve.cu so the declaration std::atomic<int> global_concurrent_halt{0}; and any other atomic usages are correctly defined and portable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/mip_heuristics/mip_constants.hpp`:
- Around line 25-27: The compile-time gate
CUOPT_CONCURRENT_LP_BARRIER_REQUIRED_THREAD_COUNT is too high for the new task
topology in run_concurrent() (where dual simplex is no longer launched); change
the constant definition in mip_constants.hpp so it requires 2 threads (PDLP +
barrier) instead of 3, and add a short comment referencing run_concurrent() /
cpp/src/pdlp/solve.cu to explain the rationale.
---
Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 40-61: This TU uses std::atomic<int> via the symbol
global_concurrent_halt but does not include <atomic>, relying on transitive
includes; add an explicit `#include` <atomic> near the other headers in solve.cu
so the declaration std::atomic<int> global_concurrent_halt{0}; and any other
atomic usages are correctly defined and portable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f8b0fe07-8fd9-4fc7-8c44-b2d9bf6a263f
📒 Files selected for processing (2)
cpp/src/mip_heuristics/mip_constants.hppcpp/src/pdlp/solve.cu
| // MIP-only gate: skip the concurrent barrier when fewer threads are available than this | ||
| // (1 PDLP + 1 dual simplex + 1 barrier). Stand-alone LP always runs all three. | ||
| #define CUOPT_CONCURRENT_LP_BARRIER_REQUIRED_THREAD_COUNT 3 |
There was a problem hiding this comment.
Lower this MIP barrier gate to match the new task topology.
In MIP mode run_concurrent() no longer launches dual simplex (cpp/src/pdlp/solve.cu, Lines 1615-1626), so this path only needs PDLP + barrier. Keeping the gate at 3 disables the barrier unnecessarily whenever exactly two threads are available.
Suggested fix
-// MIP-only gate: skip the concurrent barrier when fewer threads are available than this
-// (1 PDLP + 1 dual simplex + 1 barrier). Stand-alone LP always runs all three.
-#define CUOPT_CONCURRENT_LP_BARRIER_REQUIRED_THREAD_COUNT 3
+// MIP-only gate: skip the concurrent barrier when fewer threads are available than this
+// (1 PDLP + 1 barrier). Stand-alone LP always runs all three.
+#define CUOPT_CONCURRENT_LP_BARRIER_REQUIRED_THREAD_COUNT 2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/mip_heuristics/mip_constants.hpp` around lines 25 - 27, The
compile-time gate CUOPT_CONCURRENT_LP_BARRIER_REQUIRED_THREAD_COUNT is too high
for the new task topology in run_concurrent() (where dual simplex is no longer
launched); change the constant definition in mip_constants.hpp so it requires 2
threads (PDLP + barrier) instead of 3, and add a short comment referencing
run_concurrent() / cpp/src/pdlp/solve.cu to explain the rationale.
There was a problem hiding this comment.
The dual simplex is launched in B&B, not here. So the thread count is correct
There was a problem hiding this comment.
@nguidotti, thanks for the clarification! That makes sense — since B&B dispatches the dual simplex as its own OpenMP task that still occupies a slot in the shared thread pool, the full budget of 3 (PDLP + barrier + dual simplex from B&B) is indeed required. My analysis was wrong to treat the B&B-launched dual simplex as outside the thread count. The gate and the comment are correct as written.
🧠 Learnings used
Learnt from: aliceb-nv
Repo: NVIDIA/cuopt PR: 798
File: cpp/src/utilities/timing_utils.hpp:25-29
Timestamp: 2026-01-26T14:18:09.111Z
Learning: In NVIDIA/cuopt, headers under cpp/src should not rely on MSVC-specific features. When reviewing cpp/src/**/*.hpp, verify there is no MSVC-only code paths (e.g., `#ifdef` _MSC_VER, MSVC inline assembly syntax) and prefer portable GCC/Clang-compatible implementations for Linux. If platform-specific code is needed, guard it with portable, clearly documented checks and ensure it compiles with GCC/Clang on Linux. Maintain portability by avoiding MSVC-only fallbacks unless explicitly required by a supported build configuration.
Learnt from: aliceb-nv
Repo: NVIDIA/cuopt PR: 986
File: cpp/src/branch_and_bound/branch_and_bound.cpp:8-8
Timestamp: 2026-03-23T11:33:23.998Z
Learning: In this repo (NVIDIA/cuopt), treat nvcc as the supported CUDA toolchain; clang-based compilation/support is not required and may fail/break. During code reviews, do NOT request code changes or add blocking comments for errors that appear only under clang (e.g., header-resolution failures such as 'utilities/determinism_log.hpp not found')—these can be toolchain-related rather than real source issues.
Learnt from: bdice
Repo: NVIDIA/cuopt PR: 1035
File: cpp/tests/utilities/base_fixture.hpp:29-47
Timestamp: 2026-04-19T16:49:17.616Z
Learning: In NVIDIA/cuopt (and the upstream rapidsai/rmm) after the RMM → CCCL migration, the rmm::mr resource adaptors (e.g., rmm::mr::pool_memory_resource and rmm::mr::binning_memory_resource) are now owning: they take/hold their upstream resources by value instead of non-owning references. Therefore, direct construction of adaptor chains from temporaries/local values (e.g., pool_memory_resource(make_async(), size)) is safe and should NOT be flagged as potential dangling/lifetime bugs. Also, rmm::mr::make_owning_wrapper is no longer needed/available for this owning design, so do not suggest it in this codepath.
|
/merge |
This PR cleans #1252, so only the changes related to the OpenMP are present.
Original description:
run_concurrentincpp/src/pdlp/solve.cunow dispatches the barrier and dual simplex workers as#pragma omp taskinside a#pragma omp taskgroupinstead of rawstd::thread. PDLP still runs synchronously on the dispatching thread.MIP path (
omp_in_parallel()): reuses the upstreamsolve_mipOMP team. Barrier and dual simplex now consume slots from the configurednum_cpu_threadsbudget instead of spawning extra OS threads outside it.Stand-alone LP path: stands up a local
#pragma omp parallel + singlewith the right number of workers.This PR also removes the confuscated
std::futurelogic on the barrier on other PR.Checklist