Fix Windows / MSVC build & runtime: ViPE now runs end-to-end on Windows 10/11#84
Open
JVRHOLDINGS wants to merge 1 commit into
Open
Fix Windows / MSVC build & runtime: ViPE now runs end-to-end on Windows 10/11#84JVRHOLDINGS wants to merge 1 commit into
JVRHOLDINGS wants to merge 1 commit into
Conversation
…ws 10/11 Six patches that together make ViPE buildable + runnable on Windows without WSL/Docker. Verified end-to-end on Windows 10 build 26100 with VS 2022 Pro + CUDA 12.4 + torch 2.5.1+cu124 + Python 3.10. Root-cause patch: csrc/lietorch_ext/lietorch_cpu.cpp wraps the 12 host kernel templates in an anonymous namespace. Without this the Windows MSVC linker aliases the CPU host templates with the same-name __global__ CUDA templates in lietorch_gpu.cu, then routes every GPU kernel launch into the host C++ implementation. The host template then dereferences CUDA device pointers and the process dies with STATUS_ACCESS_VIOLATION (0xC0000005) at first BA iteration. On Linux GCC mangles them distinctly so this works "by accident". Supporting patches: - vipe/utils/io.py: Windows-safe NamedTemporaryFile for OpenEXR (Windows holds the file exclusively while Python handle is open; OpenEXR's second open fails with "Permission denied") - csrc/slam_ext/geom_kernels.cu + csrc/utils_ext/knn.cu: replace <long> template parameters with <int64_t>. MSVC long is 32-bit but the tensors are torch::kInt64; the original links cleanly on Linux (where long == int64_t) but fails LNK2001 on Windows. - vipe/ext/specs.py: emit -I<path> and /O2 on Windows (cl.exe + nvcc both accept). The original -isystem flag is silently dropped by MSVC and -O3 is unknown to cl.exe. Without this Eigen 3.4.0 auto-download never reaches the compiler. - setup.py: opt-in PDB emission via VIPE_DEBUG_SYMBOLS=1 (off by default). Enables crash-dump analysis with cdb/WinDbg/Visual Studio. All patches are gated on platform.system() == "Windows" or are semantic no-ops on Linux/macOS. See WINDOWS_SUPPORT_PR.md for the full diagnostic methodology + each patch's symptom -> root cause -> fix walkthrough.
There was a problem hiding this comment.
Pull request overview
This PR adds a set of targeted portability fixes so ViPE can build and run end-to-end on Windows (MSVC + CUDA), while keeping behavior unchanged on Linux/macOS.
Changes:
- Prevent MSVC linker symbol collisions between host and
__global__CUDA templates by giving CPU kernel templates internal linkage. - Fix Windows runtime artifact writing by avoiding
NamedTemporaryFilelocking issues when OpenEXR re-opens the temp file. - Improve Windows build compatibility by using MSVC-compatible optimization/include flags, using
int64_tinstead oflongfor kInt64 tensors, and optionally emitting PDBs viaVIPE_DEBUG_SYMBOLS=1.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
WINDOWS_SUPPORT_PR.md |
Adds detailed Windows build/runtime rationale and a working install recipe. |
vipe/utils/io.py |
Makes EXR depth artifact writing Windows-safe by using a close-then-open temp file workflow. |
vipe/ext/specs.py |
Adjusts include/optimization flags for MSVC and adds opt-in debug symbol compile flags. |
setup.py |
Adds opt-in Windows linker args to generate a consolidated PDB when requested. |
csrc/utils_ext/knn.cu |
Uses int64_t instead of long for portability with tensor sizes on Windows. |
csrc/slam_ext/geom_kernels.cu |
Replaces long with int64_t in accessors/data_ptr usage to match torch::kInt64 on Windows. |
csrc/lietorch_ext/lietorch_cpu.cpp |
Wraps CPU template kernels in an anonymous namespace to avoid MSVC symbol collisions with CUDA kernels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
20
to
25
| PACKAGE_NAME = "vipe" | ||
| SOURCE_CONFIG_DIR = Path(__file__).resolve().parent / "configs" | ||
|
|
||
| import platform | ||
|
|
||
| coder_finder_path = f"{PACKAGE_NAME}/ext/specs.py" |
Comment on lines
+51
to
+64
| Both MSVC ``cl.exe`` (modern versions, ≥ VS2017) and nvcc accept the | ||
| GCC-style ``-I<path>`` form (no space). So we emit that everywhere | ||
| instead of branching on OS. We keep the helper for symmetry with the | ||
| earlier ``-isystem`` form (which we drop on Windows because cl.exe | ||
| treats it as an unknown option and silently demotes the next arg to | ||
| "unrecognised source file"). | ||
| """ | ||
| if _IS_WINDOWS: | ||
| # `-I<path>` with no space — accepted by both cl.exe and nvcc on Win. | ||
| return [f"-I{path}"] | ||
| # Linux / macOS: keep `-isystem` so Eigen template noise doesn't pollute | ||
| # the build log. Both GCC and Clang understand this on POSIX hosts; nvcc | ||
| # forwards it to the host compiler unchanged. | ||
| return ["-isystem", path] |
| // this file) with the SAME-named __global__ CUDA template in | ||
| // lietorch_gpu.cu. Result on Windows: the GPU dispatch function | ||
| // `inv_forward_gpu` calls into the CPU template instead of launching the | ||
| // CUDA kernel, derefencing CUDA device pointers on the host → |
Comment on lines
+955
to
+956
| // data_ptr<int64_t> / <int32_t> explicit instantiations — `<long>` lives | ||
| // only in the GCC build because there `long ≡ int64_t`. |
| const long *ii_data = ii_cpu.data_ptr<long>(); | ||
| const long *jj_data = jj_cpu.data_ptr<long>(); | ||
| const long *kk_data = kk_cpu.data_ptr<long>(); | ||
| // Same `long` → int64_t portability fix as accum_cuda above. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix Windows / MSVC build & runtime: ViPE now runs end-to-end on Windows 10/11
TL;DR
Six small patches that together make ViPE buildable + runnable on Windows
without WSL or a Docker container. Verified end-to-end on Windows 10
(build 26100) with VS 2022 Professional + CUDA Toolkit 12.4 + torch
2.5.1+cu124. The pipeline now produces correct
pose/,intrinsics/,depth/,mask/,rgb/artifacts from both the bundledassets/examples/dog-example.mp4and arbitrary user clips.Patches break down:
csrc/lietorch_ext/lietorch_cpu.cpp— wrap host kernel templatesin an anonymous namespace. This is the critical fix. Without it,
the Windows MSVC linker silently aliases CPU host templates with
same-name
__global__CUDA templates inlietorch_gpu.cu, thenroutes GPU kernel launches into the host C++ implementation. The
host template dereferences CUDA device pointers and the process dies
with
STATUS_ACCESS_VIOLATION(0xC0000005) at first BA iteration.vipe/utils/io.py— Windows-safetempfile.NamedTemporaryFilehandling for OpenEXR. Windows holds an exclusive lock on the temp
file while the Python handle is open; OpenEXR's second open fails
with "Permission denied" and the pipeline aborts AFTER SLAM has
already succeeded.
csrc/slam_ext/geom_kernels.cu,csrc/utils_ext/knn.cu—replace
<long>template parameters with<int64_t>. MSVC'slongis 32-bit but the underlying tensors aretorch::kInt64(8 bytes); the original code links cleanly on Linux (where
long == int64_t) but fails to link on Windows (LNK2001:unresolved external
at::TensorBase::data_ptr<long>).vipe/ext/specs.py— Windows-aware compiler flag handling.MSVC's
cl.exedoesn't accept-O3or-isystem. Auto-downloadof Eigen 3.4.0 still works, just emit
-I<path>(which bothcland
nvccaccept) and/O2(MSVC's max-speed flag) instead ofthe GCC equivalents.
setup.py— opt-in PDB emission viaVIPE_DEBUG_SYMBOLS=1env var. Off by default to keep release builds lean; on for
developers who want crash dumps to resolve to source line numbers.
recipe that worked on Windows 10/11.
No effect on Linux or macOS — every patch is either Windows-conditional
or a portable refactor that doesn't change behaviour on POSIX.
Patch 1 (root cause) —
csrc/lietorch_ext/lietorch_cpu.cppSymptom
On Windows,
vipe infer --image-dir <frames> -o <out>consistentlyexits with code
-1073741819(0xC0000005,STATUS_ACCESS_VIOLATION)immediately after SLAM Pass 1 frontend completes, before Pass 2 starts.
No Python traceback. No CUDA runtime error. The process is killed
synchronously by the Windows kernel.
Reproducible with:
assets/examples/dog-example.mp4(bundled reference clip) — crashesat frame 24/122 in SLAM Pass 1
crashes in
backend.run(7)before Pass 2Affects every Windows install regardless of GPU, CUDA toolkit version
(tested 12.4), torch CUDA build (tested cu121 and cu124), or VRAM
budget (tested 8 GB and 16 GB GPUs).
Root cause
csrc/lietorch_ext/lietorch_cpu.cppandcsrc/lietorch_ext/lietorch_gpu.cuboth define 12 templates with identical signatures:
lietorch_cpu.cpp(host)lietorch_gpu.cu(device)exp_forward_kernel<G, T>at::parallel_for)__global__)exp_backward_kernel<G, T>log_forward_kernel<G, T>log_backward_kernel<G, T>inv_forward_kernel<G, T>inv_backward_kernel<G, T>mul_forward_kernel<G, T>mul_backward_kernel<G, T>adj_forward_kernel<G, T>adjT_forward_kernel<G, T>act_forward_kernel<G, T>act4_forward_kernel<G, T>Both translation units use the standard pybind11 emission with default
external linkage for templates. The host versions are plain C++
functions; the device versions are decorated with
__global__.On Linux + GCC:
__global__attribute participates in name mangling, producinga distinct mangled symbol for each pair
_Z18inv_forward_kernelIN7lietorch4SE3IfEEfEvPKT0_PS3_i(host) vs_Z18inv_forward_kernel...with a host-side stub for the launchregistration
On Windows + MSVC:
__global__from hostversions — both produce the same mangled symbol (e.g.
??$inv_forward_kernel@VSE3@1@M@@YAXPEBMPEAMH@Z)lietorch_cpu.objand
lietorch_gpu.objand silently picks ONE (typically theearlier one in link order = the CPU host version)
__global__GPUkernel is rewritten by the linker to call the host C++ version
instead
<<<NUM_BLOCKS, NUM_THREADS>>>syntax) gets compiled to
cudaLaunchKernel(&hostFunction, ...)—but
hostFunctionis now the C++ host loop, not the devicekernel
When
inv_forward_gpu(group_id, X)is called with X oncuda:0:X.data_ptr<float>()returns a CUDA device pointer like0x0000000c_a51f9a00. On Linux this gets passed to the GPU kernelwhich runs on the GPU and reads device memory fine. On Windows,
the call is routed to the host template in
lietorch_cpu.cpp:84:Group(ptr)invokesEigen::Matrix<float, 4, 1>::Matrix(const float*)which copies 4 floats from
ptr. The host CPU tries to read from aCUDA device address — Windows kernel raises
STATUS_ACCESS_VIOLATION(0xC0000005) at the
movups xmm2, xmmword ptr [rcx+rbx+0Ch]instruction.
Fix
Wrap all 12 host kernel templates in
lietorch_cpu.cppin ananonymous namespace. Templates in anonymous namespaces get
internal linkage in C++17, meaning their symbols are not
exported from the translation unit and the linker cannot confuse
them with same-name symbols from other translation units.
Total diff: 2 lines added (
namespace {and}), no logic changes.Verification
After patch + clean rebuild, on the exact same 8-frame test:
Artifacts:
out/pose/<name>.npz— (N, 4, 4) cam-to-world SE3 matricesout/intrinsics/<name>.npz— (N, 4) per-frame fx/fy/cx/cyout/depth/<name>.zip— N OpenEXR depth mapsout/mask/<name>.zip— N TrackAnything dynamic-content masksout/rgb/<name>.mp4— re-encoded RGB videoout/vipe/<name>_info.pkl— BA residual + meta infoDiagnostic methodology
The root cause was identified by:
python.exe(registry-only, no admin) → automatic
.dmpcapture on every crash/Zihost compile flag +/DEBUGlinker flag to emita consolidated PDB next to
vipe_ext.pydcdb -z <dump.dmp> -cf <script>with!analyze -vandk 30to extract the native stack traceWithout PDB symbols, the crash was attributed to "somewhere inside
vipe_ext, near
PyInit_vipe_ext+0x26828" — unactionable. With PDBsymbols,
!analyze -vresolved the exact source line:Combined with the deeper stack frame
vipe_ext!inv_forward_gpu::__l2::...showing the call came from the GPU dispatcher, the host-template-being-
called-instead-of-GPU-kernel pattern became unambiguous.
Patch 2 —
vipe/utils/io.py(Windows-safe OpenEXR tempfile)Symptom
After Patch 1 lands, SLAM completes successfully and reaches
save_artifacts(...). Then:Root cause
tempfile.NamedTemporaryFileopens the file withO_TEMPORARY/O_EXCLsemantics on Windows. The OS holds an exclusive lockon the underlying file while the Python handle is active. When
OpenEXR.OutputFile(f.name, ...)tries to open the same file forwriting from native code, Windows refuses with ERROR_SHARING_VIOLATION
which surfaces as "Permission denied".
On Linux this works because POSIX allows concurrent opens by default.
Fix
Create the temp file, close the Python handle immediately, hand
the name to OpenEXR, then manually unlink in a
finallyblock:Same behaviour on Linux; works on Windows.
Patch 3 —
csrc/slam_ext/geom_kernels.cu,csrc/utils_ext/knn.cu(MSVClongwidth)Symptom
At link time on Windows:
Root cause
geom_kernels.cuusestensor.data_ptr<long>()andPackedTensorAccessor32<long, ...>throughout. The tensors involved(
ii,jj,kk,idx, etc.) are explicitly constructed withtorch::TensorOptions().dtype(torch::kInt64)(8 bytes per element).On Linux + GCC,
sizeof(long) == 8so<long>resolves to theexplicit
<int64_t>template instantiation that PyTorch ships in itsprebuilt libraries.
On Windows + MSVC,
sizeof(long) == 4. PyTorch only shipsdata_ptr<int32_t>anddata_ptr<int64_t>instantiations explicitly;data_ptr<long>on Windows would need adata_ptr<int32_t>instantiation that mis-strides the underlying kInt64 storage. The
linker rejects with LNK2001.
Fix
Replace
<long>with<int64_t>everywhere in the affected.cufiles. The semantic meaning is the same on Linux (since
long == int64_t); on Windows the new code uses the correct 8-byte stride.Files touched:
csrc/slam_ext/geom_kernels.cu— all<long>→<int64_t>fortemplate parameters, plus a few
int64_tlocal variables tomatch
csrc/utils_ext/knn.cu— 2 locallongdeclarations changed toint64_tfor portability againstTensor.size()return typePatch 4 —
vipe/ext/specs.py(Windows-aware compile flags)Symptom
Build then fails with
fatal error C1083: Cannot open include file: 'eigen3/Eigen/Dense': No such file or directory.Root cause
_eigen_include_flags()and the optimization flag inget_cpp_flags()emit GCC/Clang syntax:
-isystem <path>— works for GCC/Clang, silently ignored byMSVC, AND MSVC then treats the next argument (the path) as a
source file, AND rejects it as "unrecognized source file type"
-O3— works for GCC/Clang, MSVC has no-O3; MSVC's max-speedflag is
/O2The result is that the Eigen header search path never reaches the
compiler, and Eigen 3.4.0 (which
specs.pyauto-downloads tocsrc/include/eigen3/Eigen/) becomes invisible.For
nvcccompilation of.cufiles, the host-compiler flags areforwarded via
-Xcompiler; the same problem applies, plus nvcc itselftreats
/I<path>(MSVC syntax) as an extra positional argument andaborts:
Fix
Detect Windows at module import time:
Emit
-I<path>(no space) on Windows — accepted by both MSVC'scl.exeandnvcc— and-isystem <path>on Linux/macOS (keepsEigen template noise out of warning logs):
Swap
-O3for/O2on Windows host compiles. Keep nvcc's-O3unchanged (nvcc accepts it on all platforms):
Patch 5 —
setup.py(opt-in PDB emission)Adds an
extra_link_args=["/DEBUG", "/OPT:REF", "/OPT:ICF"]clausewhen both
platform.system() == "Windows"andVIPE_DEBUG_SYMBOLS=1are set. Combined with the corresponding/Zihost compile flag inspecs.py, this produces a consolidatedvipe_ext.cp310-win_amd64.pdb(~80 MB) next to the.pydthatWinDbg / cdb / Visual Studio can use to resolve crash dumps to
source file + line.
Off by default — only developers who want crash-investigation
support need to set the env var.
Installation recipe (Windows 10/11)
Required tools:
edition with the C++ workload installed)
against)
nvidia-vipewheel target on PyPI today iscp310)
app's main env; ViPE installs
torch+ 60 other deps includingOpenEXRwhich can clash)Smoke test (bundled sample):
Expect "Finished" + non-empty
vipe_results/pose/,intrinsics/,depth/,mask/,rgb/directories.Notes for reviewers
platform.system() == "Windows"or aresemantic no-ops on Linux/macOS. Linux behaviour is unchanged.
lietorch_cpu.cppis a portableC++17 idiom and arguably improves Linux too (cleaner intent —
these templates are translation-unit private).
a
.github/workflows/windows.ymlif the maintainers want one.VIPE_DEBUG_SYMBOLS=1to keep releasebuilds slim; a maintainer might want to enable it for the nightly
build to make user crash reports actionable.
Tested on:
dog-example.mp4+ bundledcosmos-example.mp4