Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508

mmuetzel · 2023-11-07T09:02:52Z

Describe the bug
When configuring GraphBLAS with -DCOMPACT=ON on Windows, the tests of LAGraph take a considerably longer time (approx. factor 5-10).

To Reproduce
Configure and build GraphBLAS with -DCOMPACT=ON. Then, run ctest . for LAGraph.

Expected behavior
Maybe, I'm misunderstanding what should have happened. But I expected the JIT compiler to kick in which would result in only a slight increase in run duration.

Desktop (please complete the following information):

OS: Windows (MSYS2)
compiler: GCC or Clang (in CI)
BLAS and LAPACK library: OpenBLAS
Version: current head of the dev2 branch

The text was updated successfully, but these errors were encountered:

DrTimothyAldenDavis · 2023-11-07T13:29:38Z

This is expected, particularly on windows. With compact on and the JIT enabled, nearly all calls to GraphBLAS require a JIT kernel to be compiled.

The JIT kernels are normally compiled once, the cached in .SuiteSparse/GrB8.2.1 on
Linux /Mac or LOCALAPPDATA/SuiteSparse/GrB8.2.1 on Windows and then reused in the future. But in GitHub they must be compiled fresh each time.

Compilation takes time. On windows I create a cmake script to do that. On Linux and the Mac I can use cmake or a direct call to the compiler. Cmake is another overhead.

If the JIT is disabled and compact is on, the kernels I use are "generic", doing all ops like plus, times, etc thru function pointers and memcpy instead of direct assignments like A[k]=x.

Generic kernels are much slower than FactoryKernels (the latter are the builtin kernels that are turned on or off with compact mode). But they do not need to be compiled each time. Compilation of a JIT kernel is much slower than a single call to a generic kernel. I have to shell out to several calls to system("cmake here...").

So there a four kinds of kernels: 1. FactoryKernels, which are very fast but mean that building the initial GraphBLAS library takes a while. 2. JIT kernels compiled at run time which are just as fast but very specialized and take time to compile. 3. PreJIT kernels which were once JIT kernels but then incorporated into the GraphBLAS library (none used for the LAgragh tests). 4. Generic kernels which are slow but only 44 of them are in GraphBLAS so they are fast to compile.

Also, the LAGraph tests use small matrices/ graphs. For those, the difference in performance between the JIT and generic kernels is much less on tiny problems.

I'll check out how many JIT kernels are compiled if compact is on and the jit is enabled, to do all the LAGraph tests. I think it's easily 1000s. That's a lot to compile. With compact off, it's a dozen or two. Some JIT kernels are 1000s of lines long.

So the fastest way to compile AND test both GraphBLAS and LAGRAPH is compact on and JIT disabled. But in that case the libraries are slow for big problems.

mmuetzel · 2023-11-07T15:12:47Z

Thanks for clarifying. I was worried that something might be completely wrong on Windows.
I've recently merged a PR for MSYS2 that deactivates the factory kernels in the versions of GraphBLAS that they distribute. (Mainly to reduce library size.) I hope that wasn't a bad idea...

DrTimothyAldenDavis · 2023-11-07T16:23:06Z

if the JIT is enabled in production, and the FactoryKernels are disabled, then the first-time use of GraphBLAS can be slow. But an end application will tend to reuse the same kernels, and it will be fast the 2nd time. I cache all of the compiled JIT kernels in ~/.SuiteSparse/GrB8.2.1 or LOCALAPPDATA/SuiteSparse/GrB8.2.1 on Windows.

So even if the user restarts the end application, or reboots, when GraphBLAS starts up, it looks in there for any cached kernels, as well as my source code (in the src subfolder). If a JIT kernel is needed, but not loaded, I look in the cache and load it if I find it there. If not found, I compile it first (which is slow) and then load it. The load time is pretty fast, but only happens once when the application is running.

This should be OK. It would be faster when using GraphBLAS the first time, to enable the FactoryKernels, but this is not a serious issue if an end application is used over an over.

The reason the ctests on GitHub are slow is because the LOCALAPPDATA/SuiteSparse/GrB8.2.1 is not saved. So it's the first time every time. It shouldn't be saved anyway, since a small tweek to any GraphBLAS source code can invalidate them. For end user's, I protect against that issue by the name "GrB8.2.1", so the next version, even 8.2.2, will not use JIT kernels compiled by 8.2.1.

I just ran the LAGraph tests with GraphBLAS compiled in COMPACT mode ON with NJIT off, so the FactoryKernels are disabled and JIT enabled. The LAGraph ctests create 725 kernels. With COMPACT off, only 66 kernels are compiled (I have some user-defined types/operators, and some typecasting, in the LAGraph tests, and all of those go outside the realm of FactoryKernels).

Compiling those kernels seems to be faster on Linux and the Mac, because I can go directly to the compiler command "/usr/bin/gcc ..." say. I wasn't able to get that to work on windows, so I have two ways to compile a JIT kernel: direct calls to the compiler and linker, or a cmake script I create just for that kernel. The latter requires more overhead to start up cmake ... for each and every kernel. Or, it might take longer because the Windows system ("cmake -- etc") call takes longer on Windows than Linux/Mac.

I have another method to reduce the size of the library: the flags in GraphBLAS/Source/GB_control.h. Those can be uncommented, or added as compile flags, like -DGxB_NO_INT16. I can (for example) turn off all int64 and uint16 FactoryKernels. They become stubs that return a code that says "I didn't do anything" which I then use to punt to the JIT (if enabled) or generic (always present).

Prior to the JIT, the FactoryKernels were the only way to get good performance. Now that the JIT is in place, I can (eventually) get good performance without any FactoryKernels, once GraphBLAS is "warmed up" enough. So this means I could trim the # of FactoryKernels I actually include when COMPACT mode is off.

mmuetzel · 2023-11-07T16:42:37Z

That sounds good to me.
With respect to performance on Windows:
Spawning processes has a pretty large overhead. So, that might be rather slow compared to Linux. But that's only a "one-time cost" when a new kernel is required the first time if I understood correctly.
I don't know how you check if a kernel is already built. In case you do that with stat: That's also pretty slow on Windows. (That might be a "repeated cost".)

DrTimothyAldenDavis · 2023-11-08T00:27:54Z

I use LoadLibrary to try to load an existing library. If that fails, I create and compile the JIT kernel, and then try to load it again (if it fails then it's likely some kind of compiler error). This only has to be done once each time an application starts up and uses GraphBLAS. Once it's loaded in, it stays there until the application exits.

See

SuiteSparse/GraphBLAS/Source/GB_file.c

Line 313 in 8627b9c

HINSTANCE hdll = LoadLibrary (library_name) ;

That's called by the code inside GB_jitifyer.c.

DrTimothyAldenDavis · 2023-11-08T00:32:50Z

Once a JIT kernel is loaded into memory (LoadLibrary on Windows, or dlopen on Linux/Mac), I keep the function pointer to the JIT kernel in a hash table. Then when the GrB method gets called again, I encode all the variations it could be, compute its hash, and grab it from the hash table. No calls to the OS in that case, at all.

I have a short video on it here: https://youtu.be/N0cUUBGfzTo?si=AoNYsemKoKBEqldd

mmuetzel · 2023-11-08T10:03:07Z

This seems to be working quite nicely also on Windows (MinGW).
I configured (using the root CMakeLists.txt) with

cmake -DCMAKE_INSTALL_PREFIX=.. -DCMAKE_BUILD_TYPE=Release -DBLA_VENDOR=OpenBLAS -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_Fortran_COMPILER_LAUNCHER=ccache -DPython_EXECUTABLE=$(which python) -DCOMPACT=ON ..

After building the libraries, I ran ctest . twice with the following timing:

first run: Total Test time (real) = 647.04 sec
second run: Total Test time (real) = 19.04 sec

So, this seems to be a one-time cost only indeed.
This one-time overhead is pretty large here though. That might be because all JIT kernels seem to be compiled using cmake. That seems to be spawning several gcc processes before actually compiling the kernel. Is there something I could do to help speed that up (like setting some environment variables or changing configuration files)?

PS: The JIT compiler cache is at ~/.SuiteSparse/GrB8.2.1 for me albeit being on Windows. That's probably because I'm using a shell that sets HOME.

DrTimothyAldenDavis · 2023-11-09T16:26:41Z

I have two methods in GB_jitifyer.c for compiling kernels: use direct calls to the compiler, or use a cmake script created just for that kernel. See the variable there called GB_jit_use_cmake. The user has control over that, except that on Windows I force this variable to always be true.

It's just a matter of porting the GB_jitifyer_direct_compile function to Windows. I couldn't figure out how to do that. The Linux/Mac case is done with a single system("cc ... ; ld ...") call.

SuiteSparse/GraphBLAS/Source/GB_jitifyer.c

Lines 39 to 44 in 8627b9c

    
           static bool GB_jit_use_cmake = 
        
               #if GB_WINDOWS 
        
               true ;      // Windows requires cmake 
        
               #else 
        
               false ;     // otherwise, default is to skip cmake and compile directly 
        
               #endif

SuiteSparse/GraphBLAS/Source/GB_jitifyer.c

Lines 1823 to 1833 in 8627b9c

    
           // compile the kernel to get the lib*.so file 
        
           if (GB_jit_use_cmake) 
        
           {  
        
               // use cmake to compile the kernel 
        
               GB_jitifyer_cmake_compile (kernel_name, hash) ; 
        
           } 
        
           else 
        
           {  
        
               // use the compiler to directly compile the kernel 
        
               GB_jitifyer_direct_compile (kernel_name, bucket) ; 
        
           }

The direct compile method only works on Linux / Mac. I couldn't get this to work on Windows:

SuiteSparse/GraphBLAS/Source/GB_jitifyer.c

Lines 2289 to 2359 in 8627b9c

    
           //------------------------------------------------------------------------------ 
        
           // GB_jitifyer_direct_compile: compile a kernel with just the compiler 
        
           //------------------------------------------------------------------------------ 
        
           // This method does not return any error/success code.  If the compilation 
        
           // fails for any reason, the subsequent load of the compiled kernel will fail. 
        
           // This method does not work on Windows.  
        
           void GB_jitifyer_direct_compile (char *kernel_name, uint32_t bucket) 
        
           {  
        
           #ifndef NJIT 
        
               char *burble_stdout = GB_Global_burble_get ( ) ? "" : GB_DEV_NULL ; 
        
               char *err_redirect = (strlen (GB_jit_error_log) > 0) ? " 2>> " : "" ; 
        
               snprintf (GB_jit_temp, GB_jit_temp_allocated, 
        
               // compile: 
        
               "%s -DGB_JIT_RUNTIME=1 "            // compiler command 
        
               "%s "                               // C flags 
        
               "-I%s/src "                         // include source directory 
        
               "%s "                               // openmp include directories 
        
               "-o %s/c/%02x/%s%s "                // *.o output file 
        
               "-c %s/c/%02x/%s.c "                // *.c input file 
        
               "%s "                               // burble stdout 
        
               "%s %s ; "                          // error log file 
        
               // link: 
        
               "%s "                               // C compiler 
        
               "%s "                               // C flags 
        
               "%s "                               // C link flags 
        
               "-o %s/lib/%02x/%s%s%s "            // lib*.so output file 
        
               "%s/c/%02x/%s%s "                   // *.o input file 
        
               "%s "                               // libraries to link with 
        
               "%s"                                // burble stdout 
        
               "%s %s ",                           // error log file 
        
               // compile: 
        
               GB_jit_C_compiler,                  // C compiler 
        
               GB_jit_C_flags,                     // C flags 
        
               GB_jit_cache_path,                  // include source directory (cache/src) 
        
               GB_OMP_INC,                         // openmp include 
        
               GB_jit_cache_path, bucket, kernel_name, GB_OBJ_SUFFIX,  // *.o output file 
        
               GB_jit_cache_path, bucket, kernel_name,                 // *.c input file 
        
               burble_stdout,                      // burble stdout 
        
               err_redirect, GB_jit_error_log,     // error log file 
        
               // link: 
        
               GB_jit_C_compiler,                  // C compiler 
        
               GB_jit_C_flags,                     // C flags 
        
               GB_jit_C_link_flags,                // C link flags 
        
               GB_jit_cache_path, bucket,   
        
               GB_LIB_PREFIX, kernel_name, GB_LIB_SUFFIX,              // lib*.so file 
        
               GB_jit_cache_path, bucket, kernel_name, GB_OBJ_SUFFIX,  // *.o input file 
        
               GB_jit_C_libraries,                 // libraries to link with 
        
               burble_stdout,                      // burble stdout 
        
               err_redirect, GB_jit_error_log) ;   // error log file 
        
               // compile the library and return result 
        
               GBURBLE ("(jit: %s) ", GB_jit_temp) ; 
        
               GB_jitifyer_command (GB_jit_temp) ; 
        
               // remove the *.o file 
        
               snprintf (GB_jit_temp, GB_jit_temp_allocated, "%s/c/%02x/%s%s", 
        
                   GB_jit_cache_path, bucket, kernel_name, GB_OBJ_SUFFIX) ; 
        
               remove (GB_jit_temp) ; 
        
           #endif 
        
           }

mmuetzel · 2023-11-10T17:10:01Z

Thanks for these pointers. I opened #513 with changes that let me use the JIT compiler without CMake on Windows using MinGW compilers (from MSYS2).

mmuetzel · 2023-11-14T19:52:21Z

I believe this is solved now. (At least for Windows using MinGW.)
And it's clear why using the JIT compiler will be slow in the CI.

Thank you again for the explanations and pointers.

mmuetzel assigned DrTimothyAldenDavis Nov 7, 2023

mmuetzel closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508

Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508

mmuetzel commented Nov 7, 2023 •

edited

Loading

DrTimothyAldenDavis commented Nov 7, 2023

mmuetzel commented Nov 7, 2023

DrTimothyAldenDavis commented Nov 7, 2023

mmuetzel commented Nov 7, 2023

DrTimothyAldenDavis commented Nov 8, 2023

DrTimothyAldenDavis commented Nov 8, 2023

mmuetzel commented Nov 8, 2023

DrTimothyAldenDavis commented Nov 9, 2023

mmuetzel commented Nov 10, 2023 •

edited

Loading

mmuetzel commented Nov 14, 2023

Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508

Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508

Comments

mmuetzel commented Nov 7, 2023 • edited Loading

DrTimothyAldenDavis commented Nov 7, 2023

mmuetzel commented Nov 7, 2023

DrTimothyAldenDavis commented Nov 7, 2023

mmuetzel commented Nov 7, 2023

DrTimothyAldenDavis commented Nov 8, 2023

DrTimothyAldenDavis commented Nov 8, 2023

mmuetzel commented Nov 8, 2023

DrTimothyAldenDavis commented Nov 9, 2023

mmuetzel commented Nov 10, 2023 • edited Loading

mmuetzel commented Nov 14, 2023

mmuetzel commented Nov 7, 2023 •

edited

Loading

mmuetzel commented Nov 10, 2023 •

edited

Loading