Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] test11_reinterpret in test_arithmetic.py Failing in Local Build #322

Open
Microno95 opened this issue Dec 22, 2024 · 3 comments
Open

Comments

@Microno95
Copy link
Contributor

I've been encountering a Fatal Python error: Aborted issue in the latest master of Mitsuba 3, and to identify where things are going wrong, I decided to test that Dr.Jit itself compiles and runs correctly by running the Dr.Jit test suite. Unfortunately, I am unable to run the Dr.Jit test suite as every time the suite hits test11_reinterpret in test_arithmetic.py, the test suite crashes with Fatal Python error: Aborted. I've tried setting dr.set_flag(dr.JitFlag.Debug, True), tried compiling the v1.0.1 release to no avail and varied the LLVM version used between the LLVM 18.1.8 bundled with MSVC, LLVM 18.1.8 prebuilt binary from the official LLVM github repo, and LLVM 16.0.4 prebuilt binary from the official LLVM github repo.

From running the test suite in Debug and RelWithDebInfo modes, I have identified that the specific issue occurs on

with pytest.raises(RuntimeError) as e:
dr.reinterpret_array(U64, t(1.0))
, and, from what I can tell, occurs because when the stack unwinds in jitc_raise from https://github.com/mitsuba-renderer/drjit-core/blob/71a9f4865b88531d8a490eac03a3659040529df1/src/op.cpp#L1488-L1489 the destructor of one of the variables is called which in turn triggers a call to jit_var_dec_ref_impl. This tries to lock the mutex for the jit variable states and leads to a failure of the runtime.

I've tried the same tests using the pip installed drjit library and don't encounter this issue at all so perhaps it is the setup I have, but I'd like to resolve it to avoid issues downstream when compiling my own version of Mitsuba with the latest master.

Stack at error:

<unknown> 0x00007ffb52722fb6
lock_acquire(std::mutex &) common.h:60
lock_guard::lock_guard(std::mutex &) common.h:67
jit_var_dec_ref_impl(unsigned int) api.cpp:594
drjit::JitArray::~JitArray<…>() jit.h:72
nanobind::detail::wrap_destruct<…>(void *) nb_class.h:241
nanobind::detail::inst_dealloc(_object *) nb_type.cpp:241

My system configuration is using the latest Microsoft Visual Studio 2022 (MSVC 19.42.34435.0).

Help on module drjit.config in drjit:                                                                                                                                                                                           

NAME
    drjit.config - # DO NOT MODIFY: This file is automatically generated by the CMake configuration

DATA
    CXX_COMPILER = 'MSVC 19.42.34435.0'
    PYTHON_VERSION = '3.12.8'

FILE
    c:\users\ekin4\clionprojects\drjit\cmake-build-release\drjit\config.py
# packages in environment at C:\Users\ekin4\.conda\envs\drjit:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl  
bzip2                     1.0.8                h2bbff1b_6  
ca-certificates           2024.11.26           haa95532_0  
colorama                  0.4.6                    pypi_0    pypi
expat                     2.6.4                h8ddb27b_0  
icc_rt                    2022.1.0             h6049295_2  
iniconfig                 2.0.0                    pypi_0    pypi
intel-openmp              2023.1.0         h59b6b97_46320  
libffi                    3.4.4                hd77b12b_1  
mkl                       2023.1.0         h6b88ed4_46358  
mkl-service               2.4.0           py312h2bbff1b_1  
mkl_fft                   1.3.11          py312h827c3e9_0  
mkl_random                1.2.8           py312h0158946_0  
numpy                     2.1.3           py312hfd52020_0  
numpy-base                2.1.3           py312h4dde369_0  
openssl                   3.0.15               h827c3e9_0  
packaging                 24.2                     pypi_0    pypi
pip                       24.2            py312haa95532_0  
pluggy                    1.5.0                    pypi_0    pypi
pytest                    8.3.4                    pypi_0    pypi
python                    3.12.8               h14ffc60_0  
scipy                     1.14.1          py312h9d85e7c_0  
setuptools                72.1.0          py312haa95532_0
sqlite                    3.45.3               h2bbff1b_0
tbb                       2021.8.0             h59b6b97_0
tk                        8.6.14               h0416ee5_0
tzdata                    2024b                h04d1e81_0
vc                        14.40                haa95532_2
vs2015_runtime            14.42.34433          h9531ae6_2
wheel                     0.44.0          py312haa95532_0
xz                        5.4.6                h8cc25b3_1
zlib                      1.2.13               h8cc25b3_1
@Microno95
Copy link
Contributor Author

Microno95 commented Dec 25, 2024

After testing many different configurations and compiler, the combination that ultimately worked was to ensure that the generated project files from CMake were set with -G Visual Studio 17 2022 otherwise the default, -G Ninja in CLion, would lead to exceptions when running the test suite. Similarly, I could only get LLVM 15.0.7 to run the full test suite without errors.

When using Ninja, the exception would consistently occur in test11_reinterpret of test_arithmetic.py. Also, I found that with the LLVM bundled with MSVC, I would consistently get JIT session error: Failed to materialize symbols: { (main, { __xmm@01010101010101010101010101010101 }) }.

@njroussel
Copy link
Member

Hi @Microno95

That sounds like a plausible explanation. We actually only run tests on an environment that compiles with MSVC 17 as the generator. In facts, for Mitsuba its actually explicitly recommended.

I might be looking at the wrong thing, but it also seems that the GitHub runners that we use to produce the wheels that can be installed with pip are built with MSVC 17.

As for the stack trace itself, I don't see how/why that could happen. Maybe there's some UB in the cleanup order due to exceptions, but the global locks should be released & available.

@Microno95
Copy link
Contributor Author

I see, okay. I will stick to the MSVC 17 generators from now on just to be safe. Although I could not, for the life of me, get Mitsuba to compile correctly with MSVC 17, only MSVC 16...

Yeah, the stack trace doesn't really make sense given that the lock should've unlocked by that point, but, as you said, the order may be UB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants