-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests of LAGraph are very slow when disabling factory kernels of GraphBLAS on Windows #508
Comments
This is expected, particularly on windows. With compact on and the JIT enabled, nearly all calls to GraphBLAS require a JIT kernel to be compiled. The JIT kernels are normally compiled once, the cached in .SuiteSparse/GrB8.2.1 on Compilation takes time. On windows I create a cmake script to do that. On Linux and the Mac I can use cmake or a direct call to the compiler. Cmake is another overhead. If the JIT is disabled and compact is on, the kernels I use are "generic", doing all ops like plus, times, etc thru function pointers and memcpy instead of direct assignments like A[k]=x. Generic kernels are much slower than FactoryKernels (the latter are the builtin kernels that are turned on or off with compact mode). But they do not need to be compiled each time. Compilation of a JIT kernel is much slower than a single call to a generic kernel. I have to shell out to several calls to system("cmake here..."). So there a four kinds of kernels: 1. FactoryKernels, which are very fast but mean that building the initial GraphBLAS library takes a while. 2. JIT kernels compiled at run time which are just as fast but very specialized and take time to compile. 3. PreJIT kernels which were once JIT kernels but then incorporated into the GraphBLAS library (none used for the LAgragh tests). 4. Generic kernels which are slow but only 44 of them are in GraphBLAS so they are fast to compile. Also, the LAGraph tests use small matrices/ graphs. For those, the difference in performance between the JIT and generic kernels is much less on tiny problems. I'll check out how many JIT kernels are compiled if compact is on and the jit is enabled, to do all the LAGraph tests. I think it's easily 1000s. That's a lot to compile. With compact off, it's a dozen or two. Some JIT kernels are 1000s of lines long. So the fastest way to compile AND test both GraphBLAS and LAGRAPH is compact on and JIT disabled. But in that case the libraries are slow for big problems. |
Thanks for clarifying. I was worried that something might be completely wrong on Windows. |
if the JIT is enabled in production, and the FactoryKernels are disabled, then the first-time use of GraphBLAS can be slow. But an end application will tend to reuse the same kernels, and it will be fast the 2nd time. I cache all of the compiled JIT kernels in So even if the user restarts the end application, or reboots, when GraphBLAS starts up, it looks in there for any cached kernels, as well as my source code (in the src subfolder). If a JIT kernel is needed, but not loaded, I look in the cache and load it if I find it there. If not found, I compile it first (which is slow) and then load it. The load time is pretty fast, but only happens once when the application is running. This should be OK. It would be faster when using GraphBLAS the first time, to enable the FactoryKernels, but this is not a serious issue if an end application is used over an over. The reason the ctests on GitHub are slow is because the I just ran the LAGraph tests with GraphBLAS compiled in COMPACT mode ON with NJIT off, so the FactoryKernels are disabled and JIT enabled. The LAGraph ctests create 725 kernels. With COMPACT off, only 66 kernels are compiled (I have some user-defined types/operators, and some typecasting, in the LAGraph tests, and all of those go outside the realm of FactoryKernels). Compiling those kernels seems to be faster on Linux and the Mac, because I can go directly to the compiler command "/usr/bin/gcc ..." say. I wasn't able to get that to work on windows, so I have two ways to compile a JIT kernel: direct calls to the compiler and linker, or a cmake script I create just for that kernel. The latter requires more overhead to start up cmake ... for each and every kernel. Or, it might take longer because the Windows I have another method to reduce the size of the library: the flags in GraphBLAS/Source/GB_control.h. Those can be uncommented, or added as compile flags, like Prior to the JIT, the FactoryKernels were the only way to get good performance. Now that the JIT is in place, I can (eventually) get good performance without any FactoryKernels, once GraphBLAS is "warmed up" enough. So this means I could trim the # of FactoryKernels I actually include when COMPACT mode is off. |
That sounds good to me. |
I use LoadLibrary to try to load an existing library. If that fails, I create and compile the JIT kernel, and then try to load it again (if it fails then it's likely some kind of compiler error). This only has to be done once each time an application starts up and uses GraphBLAS. Once it's loaded in, it stays there until the application exits. See SuiteSparse/GraphBLAS/Source/GB_file.c Line 313 in 8627b9c
That's called by the code inside GB_jitifyer.c. |
Once a JIT kernel is loaded into memory (LoadLibrary on Windows, or dlopen on Linux/Mac), I keep the function pointer to the JIT kernel in a hash table. Then when the GrB method gets called again, I encode all the variations it could be, compute its hash, and grab it from the hash table. No calls to the OS in that case, at all. I have a short video on it here: https://youtu.be/N0cUUBGfzTo?si=AoNYsemKoKBEqldd |
This seems to be working quite nicely also on Windows (MinGW).
After building the libraries, I ran
So, this seems to be a one-time cost only indeed. PS: The JIT compiler cache is at |
I have two methods in GB_jitifyer.c for compiling kernels: use direct calls to the compiler, or use a cmake script created just for that kernel. See the variable there called GB_jit_use_cmake. The user has control over that, except that on Windows I force this variable to always be true. It's just a matter of porting the GB_jitifyer_direct_compile function to Windows. I couldn't figure out how to do that. The Linux/Mac case is done with a single SuiteSparse/GraphBLAS/Source/GB_jitifyer.c Lines 39 to 44 in 8627b9c
SuiteSparse/GraphBLAS/Source/GB_jitifyer.c Lines 1823 to 1833 in 8627b9c
The direct compile method only works on Linux / Mac. I couldn't get this to work on Windows: SuiteSparse/GraphBLAS/Source/GB_jitifyer.c Lines 2289 to 2359 in 8627b9c
|
Thanks for these pointers. I opened #513 with changes that let me use the JIT compiler without CMake on Windows using MinGW compilers (from MSYS2). |
I believe this is solved now. (At least for Windows using MinGW.) Thank you again for the explanations and pointers. |
Describe the bug
When configuring GraphBLAS with
-DCOMPACT=ON
on Windows, the tests of LAGraph take a considerably longer time (approx. factor 5-10).To Reproduce
Configure and build GraphBLAS with
-DCOMPACT=ON
. Then, runctest .
for LAGraph.Expected behavior
Maybe, I'm misunderstanding what should have happened. But I expected the JIT compiler to kick in which would result in only a slight increase in run duration.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: