Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508

Closed
mmuetzel opened this issue Nov 7, 2023 · 10 comments
Assignees

Comments

@mmuetzel
Copy link
Contributor

mmuetzel commented Nov 7, 2023

Describe the bug
When configuring GraphBLAS with -DCOMPACT=ON on Windows, the tests of LAGraph take a considerably longer time (approx. factor 5-10).

To Reproduce
Configure and build GraphBLAS with -DCOMPACT=ON. Then, run ctest . for LAGraph.

Expected behavior
Maybe, I'm misunderstanding what should have happened. But I expected the JIT compiler to kick in which would result in only a slight increase in run duration.

Desktop (please complete the following information):

  • OS: Windows (MSYS2)
  • compiler: GCC or Clang (in CI)
  • BLAS and LAPACK library: OpenBLAS
  • Version: current head of the dev2 branch
@DrTimothyAldenDavis
Copy link
Owner

This is expected, particularly on windows. With compact on and the JIT enabled, nearly all calls to GraphBLAS require a JIT kernel to be compiled.

The JIT kernels are normally compiled once, the cached in .SuiteSparse/GrB8.2.1 on
Linux /Mac or LOCALAPPDATA/SuiteSparse/GrB8.2.1 on Windows and then reused in the future. But in GitHub they must be compiled fresh each time.

Compilation takes time. On windows I create a cmake script to do that. On Linux and the Mac I can use cmake or a direct call to the compiler. Cmake is another overhead.

If the JIT is disabled and compact is on, the kernels I use are "generic", doing all ops like plus, times, etc thru function pointers and memcpy instead of direct assignments like A[k]=x.

Generic kernels are much slower than FactoryKernels (the latter are the builtin kernels that are turned on or off with compact mode). But they do not need to be compiled each time. Compilation of a JIT kernel is much slower than a single call to a generic kernel. I have to shell out to several calls to system("cmake here...").

So there a four kinds of kernels: 1. FactoryKernels, which are very fast but mean that building the initial GraphBLAS library takes a while. 2. JIT kernels compiled at run time which are just as fast but very specialized and take time to compile. 3. PreJIT kernels which were once JIT kernels but then incorporated into the GraphBLAS library (none used for the LAgragh tests). 4. Generic kernels which are slow but only 44 of them are in GraphBLAS so they are fast to compile.

Also, the LAGraph tests use small matrices/ graphs. For those, the difference in performance between the JIT and generic kernels is much less on tiny problems.

I'll check out how many JIT kernels are compiled if compact is on and the jit is enabled, to do all the LAGraph tests. I think it's easily 1000s. That's a lot to compile. With compact off, it's a dozen or two. Some JIT kernels are 1000s of lines long.

So the fastest way to compile AND test both GraphBLAS and LAGRAPH is compact on and JIT disabled. But in that case the libraries are slow for big problems.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 7, 2023

Thanks for clarifying. I was worried that something might be completely wrong on Windows.
I've recently merged a PR for MSYS2 that deactivates the factory kernels in the versions of GraphBLAS that they distribute. (Mainly to reduce library size.) I hope that wasn't a bad idea...

@DrTimothyAldenDavis
Copy link
Owner

if the JIT is enabled in production, and the FactoryKernels are disabled, then the first-time use of GraphBLAS can be slow. But an end application will tend to reuse the same kernels, and it will be fast the 2nd time. I cache all of the compiled JIT kernels in ~/.SuiteSparse/GrB8.2.1 or LOCALAPPDATA/SuiteSparse/GrB8.2.1 on Windows.

So even if the user restarts the end application, or reboots, when GraphBLAS starts up, it looks in there for any cached kernels, as well as my source code (in the src subfolder). If a JIT kernel is needed, but not loaded, I look in the cache and load it if I find it there. If not found, I compile it first (which is slow) and then load it. The load time is pretty fast, but only happens once when the application is running.

This should be OK. It would be faster when using GraphBLAS the first time, to enable the FactoryKernels, but this is not a serious issue if an end application is used over an over.

The reason the ctests on GitHub are slow is because the LOCALAPPDATA/SuiteSparse/GrB8.2.1 is not saved. So it's the first time every time. It shouldn't be saved anyway, since a small tweek to any GraphBLAS source code can invalidate them. For end user's, I protect against that issue by the name "GrB8.2.1", so the next version, even 8.2.2, will not use JIT kernels compiled by 8.2.1.

I just ran the LAGraph tests with GraphBLAS compiled in COMPACT mode ON with NJIT off, so the FactoryKernels are disabled and JIT enabled. The LAGraph ctests create 725 kernels. With COMPACT off, only 66 kernels are compiled (I have some user-defined types/operators, and some typecasting, in the LAGraph tests, and all of those go outside the realm of FactoryKernels).

Compiling those kernels seems to be faster on Linux and the Mac, because I can go directly to the compiler command "/usr/bin/gcc ..." say. I wasn't able to get that to work on windows, so I have two ways to compile a JIT kernel: direct calls to the compiler and linker, or a cmake script I create just for that kernel. The latter requires more overhead to start up cmake ... for each and every kernel. Or, it might take longer because the Windows system ("cmake -- etc") call takes longer on Windows than Linux/Mac.

I have another method to reduce the size of the library: the flags in GraphBLAS/Source/GB_control.h. Those can be uncommented, or added as compile flags, like -DGxB_NO_INT16. I can (for example) turn off all int64 and uint16 FactoryKernels. They become stubs that return a code that says "I didn't do anything" which I then use to punt to the JIT (if enabled) or generic (always present).

Prior to the JIT, the FactoryKernels were the only way to get good performance. Now that the JIT is in place, I can (eventually) get good performance without any FactoryKernels, once GraphBLAS is "warmed up" enough. So this means I could trim the # of FactoryKernels I actually include when COMPACT mode is off.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 7, 2023

That sounds good to me.
With respect to performance on Windows:
Spawning processes has a pretty large overhead. So, that might be rather slow compared to Linux. But that's only a "one-time cost" when a new kernel is required the first time if I understood correctly.
I don't know how you check if a kernel is already built. In case you do that with stat: That's also pretty slow on Windows. (That might be a "repeated cost".)

@DrTimothyAldenDavis
Copy link
Owner

I use LoadLibrary to try to load an existing library. If that fails, I create and compile the JIT kernel, and then try to load it again (if it fails then it's likely some kind of compiler error). This only has to be done once each time an application starts up and uses GraphBLAS. Once it's loaded in, it stays there until the application exits.

See

HINSTANCE hdll = LoadLibrary (library_name) ;

That's called by the code inside GB_jitifyer.c.

@DrTimothyAldenDavis
Copy link
Owner

Once a JIT kernel is loaded into memory (LoadLibrary on Windows, or dlopen on Linux/Mac), I keep the function pointer to the JIT kernel in a hash table. Then when the GrB method gets called again, I encode all the variations it could be, compute its hash, and grab it from the hash table. No calls to the OS in that case, at all.

I have a short video on it here: https://youtu.be/N0cUUBGfzTo?si=AoNYsemKoKBEqldd

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 8, 2023

This seems to be working quite nicely also on Windows (MinGW).
I configured (using the root CMakeLists.txt) with

cmake -DCMAKE_INSTALL_PREFIX=.. -DCMAKE_BUILD_TYPE=Release -DBLA_VENDOR=OpenBLAS -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_Fortran_COMPILER_LAUNCHER=ccache -DPython_EXECUTABLE=$(which python) -DCOMPACT=ON ..

After building the libraries, I ran ctest . twice with the following timing:

  • first run: Total Test time (real) = 647.04 sec
  • second run: Total Test time (real) = 19.04 sec

So, this seems to be a one-time cost only indeed.
This one-time overhead is pretty large here though. That might be because all JIT kernels seem to be compiled using cmake. That seems to be spawning several gcc processes before actually compiling the kernel. Is there something I could do to help speed that up (like setting some environment variables or changing configuration files)?

PS: The JIT compiler cache is at ~/.SuiteSparse/GrB8.2.1 for me albeit being on Windows. That's probably because I'm using a shell that sets HOME.

@DrTimothyAldenDavis
Copy link
Owner

I have two methods in GB_jitifyer.c for compiling kernels: use direct calls to the compiler, or use a cmake script created just for that kernel. See the variable there called GB_jit_use_cmake. The user has control over that, except that on Windows I force this variable to always be true.

It's just a matter of porting the GB_jitifyer_direct_compile function to Windows. I couldn't figure out how to do that. The Linux/Mac case is done with a single system("cc ... ; ld ...") call.

static bool GB_jit_use_cmake =
#if GB_WINDOWS
true ; // Windows requires cmake
#else
false ; // otherwise, default is to skip cmake and compile directly
#endif

// compile the kernel to get the lib*.so file
if (GB_jit_use_cmake)
{
// use cmake to compile the kernel
GB_jitifyer_cmake_compile (kernel_name, hash) ;
}
else
{
// use the compiler to directly compile the kernel
GB_jitifyer_direct_compile (kernel_name, bucket) ;
}

The direct compile method only works on Linux / Mac. I couldn't get this to work on Windows:

//------------------------------------------------------------------------------
// GB_jitifyer_direct_compile: compile a kernel with just the compiler
//------------------------------------------------------------------------------
// This method does not return any error/success code. If the compilation
// fails for any reason, the subsequent load of the compiled kernel will fail.
// This method does not work on Windows.
void GB_jitifyer_direct_compile (char *kernel_name, uint32_t bucket)
{
#ifndef NJIT
char *burble_stdout = GB_Global_burble_get ( ) ? "" : GB_DEV_NULL ;
char *err_redirect = (strlen (GB_jit_error_log) > 0) ? " 2>> " : "" ;
snprintf (GB_jit_temp, GB_jit_temp_allocated,
// compile:
"%s -DGB_JIT_RUNTIME=1 " // compiler command
"%s " // C flags
"-I%s/src " // include source directory
"%s " // openmp include directories
"-o %s/c/%02x/%s%s " // *.o output file
"-c %s/c/%02x/%s.c " // *.c input file
"%s " // burble stdout
"%s %s ; " // error log file
// link:
"%s " // C compiler
"%s " // C flags
"%s " // C link flags
"-o %s/lib/%02x/%s%s%s " // lib*.so output file
"%s/c/%02x/%s%s " // *.o input file
"%s " // libraries to link with
"%s" // burble stdout
"%s %s ", // error log file
// compile:
GB_jit_C_compiler, // C compiler
GB_jit_C_flags, // C flags
GB_jit_cache_path, // include source directory (cache/src)
GB_OMP_INC, // openmp include
GB_jit_cache_path, bucket, kernel_name, GB_OBJ_SUFFIX, // *.o output file
GB_jit_cache_path, bucket, kernel_name, // *.c input file
burble_stdout, // burble stdout
err_redirect, GB_jit_error_log, // error log file
// link:
GB_jit_C_compiler, // C compiler
GB_jit_C_flags, // C flags
GB_jit_C_link_flags, // C link flags
GB_jit_cache_path, bucket,
GB_LIB_PREFIX, kernel_name, GB_LIB_SUFFIX, // lib*.so file
GB_jit_cache_path, bucket, kernel_name, GB_OBJ_SUFFIX, // *.o input file
GB_jit_C_libraries, // libraries to link with
burble_stdout, // burble stdout
err_redirect, GB_jit_error_log) ; // error log file
// compile the library and return result
GBURBLE ("(jit: %s) ", GB_jit_temp) ;
GB_jitifyer_command (GB_jit_temp) ;
// remove the *.o file
snprintf (GB_jit_temp, GB_jit_temp_allocated, "%s/c/%02x/%s%s",
GB_jit_cache_path, bucket, kernel_name, GB_OBJ_SUFFIX) ;
remove (GB_jit_temp) ;
#endif
}

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 10, 2023

Thanks for these pointers. I opened #513 with changes that let me use the JIT compiler without CMake on Windows using MinGW compilers (from MSYS2).

@mmuetzel
Copy link
Contributor Author

I believe this is solved now. (At least for Windows using MinGW.)
And it's clear why using the JIT compiler will be slow in the CI.

Thank you again for the explanations and pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants