Benchmarking sisalc, zig cc, and gcc to compile Sisal programmes.
Sisal is a long-forgotten language from the mid-80s, designed with supercomputers in mind. It is a functional language -- more specifically a dataflow language -- that compiles down to a dataflow graph (.if1) before eventually compiling to C.
Language-wise, it reads well and does things like parallelisation very well. Unfortunately, the project died in the 90s/early 00s. As such, there's virtually no modern-day discussions around the language. This is an attempt to kick some life into the project -- because I think it's pretty cool.
I use Sisal 14.1.0. See my other repo for an installation guide (Linux) and some small programmes. More to follow. Alternatively, pick it up directly from Source Forge and hack away: https://sourceforge.net/projects/sisal/.
Big thanks to Pat Miller for keeping this alive and putting sisalc up on Source Forge :)
My C isn't too great, but I am picking up things as I go and learning things about compilers along the way. Ultimately, I want to write a new compiler for the IF1 dataflow graphs and get it into MLIR/XLA, bypassing C. From what I've read, it is similar to XLA's HLO graph. This could (theoretically) let us write Sisal that targets GPU and TPUs (!).
Using sisalc, one can compile foo.sis and output an executable ./foo by running
sisalc -o foo foo.sis # May need to add -l<C lib> if calling C libs e.g. C's math library is -lmAdditional compiler flags let us turn off bound-checking and apply some optimisations:
sisalc -no-bounds -aggvector -vector -p=<No. Threads> -o foo foo.sisTo run it, we pipe in any input we want (space seperated):
echo "<my args>" | ./fooOr, using foo.in rather than echo:
./foo < foo.in > 2>&1Optionally, we can use the flag -wX to set the number of threads:
echo "<my args>" | ./foo -w12 # Running on 12 threadsWe can take it one further and put taskset in front of the executable call to get a little more performance out of it:
echo "<my args>" | taskset -c 0-11 ./foo -w12The binary comes with more flags in addition to -wX. For instance, -z can be used to suppress output (as is used in the benchmarking below) and flags like -gss, -cached and -strided apply different scheduling algorithms to improve performance.
echo "<my args>" | taskset -c 0-11 ./foo -w12 -z -gssA full set can be viewed by calling -usage.
Now, to extract the intermediate foo.c file, we can run:
sisalc -C -forC foo.sisInterested in more intermediate files? -keep can be used to compile the programme and keep all the files:
sisalc -keep -o foo foo.sisThe above works in combination with optimisation flags. The resultant C code can be checked to see if e.g. -no-bounds did its job.
This works pretty well, but what if we didn't want to use GCC?
We can use Zig as a drop-in C compiler for foo.c as follows:
sisalc -keep -o foo foo.sis # For some reason -forC produces errors when calling zig cc
zig cc -o foo /usr/local/lib/sisal/srt0.o foo.c \
-I/usr/local/include/sisal \
-L/usr/local/lib/sisal \
-Wl,--whole-archive -lsisal -Wl,--no-whole-archive
# Clean-up
rm foo.mem foo.mono foo.o foo.opt foo.part foo.upYou might find the Zig compiled binary is actually slower that sisalc's since we don't use any optimisers. We can pass flags to zig cc as we would any olf foo.c. Below are some I found to really crank up the performance just to see what :
zig cc -O3 -march=native -ffast-math -flto=thin -fno-math-errno \
-mllvm -enable-loopinterchange -mllvm -enable-loop-distribute \
-funroll-loops -mprefer-vector-width=256 -mllvm -force-vector-interleave=4 \
-mllvm -prefetch-distance=8 -mllvm --loop-prefetch-writes \
-o foo /usr/local/lib/sisal/srt0.o foo.c \
-I/usr/local/include/sisal \
-L/usr/local/lib/sisal \
-Wl,--whole-archive -lsisalThe above isn't neccessary in all cases, like -ffast-math could lead to imprecision at the decimal point if floating point errors accumulate. Below is a summary of each flag (Claude helped here):
-O3: Maximum compiler optimisation; enables auto-vectorisation, aggressive inlining, and loop transformations-march=native: Emit instructions for the exact CPU running the build, enabling AVX2/AVX-512 and other extensions not available in generic x86-64 code-ffast-math: Relaxes strict IEEE 754 floating-point rules, allowing reordering and approximation of float operations for speed-flto=thin: Enables ThinLTO (Link Time Optimisation); allows the compiler to optimise across translation unit boundaries at link time, with lower memory cost than full LTO-fno-math-errno: Skips settingerrnoafter math functions likesqrt; removes a hidden branch and memory write after every such call-mllvm -enable-loopinterchange: Allows LLVM to swap the order of nested loops to improve cache locality (e.g. making the inner loop stride sequentially through memory)-mllvm -enable-loop-distribute: Splits loops with multiple independent bodies into separate loops, enabling each part to be vectorised independently-funroll-loops: Unrolls loop bodies to reduce branch overhead and expose more instruction-level parallelism to the CPU-mprefer-vector-width=256: Instructs LLVM to prefer 256-bit (AVX2) vector operations over 128-bit or 512-bit, balancing throughput against frequency throttling on most desktop CPUs-mllvm -force-vector-interleave=4: Interleaves 4 vector iterations in the loop body, hiding memory latency by keeping the CPU's out-of-order execution units busy-mllvm -prefetch-distance=8: Inserts software prefetch instructions 8 iterations ahead of the current loop position, reducing cache miss stalls-mllvm --loop-prefetch-writes: Extends prefetching to write targets as well as reads, useful when the loop both reads and writes large arrays
Included in this repo is zig-compile-sisal.sh which just makes it easier rather than typing out the commands.
The shell script takes foo as an input and returns ./foo:
bash zig-compile-sisal.sh fooThis uses the LLVM flags. They can be removed if not needed.
So, the gcc implementation looks as follows and is very similar to zig cc -- an important difference being the linkers after -lsisal:
gcc -o dcf /usr/local/lib/sisal/srt0.o dcf.c \
-I/usr/local/include/sisal \
-L/usr/local/lib/sisal \
-Wl,--whole-archive -lsisal -Wl,--no-whole-archive \
-lmOptimisation flags also behave similarly -- minus the LLVM flags:
gcc -O3 -march=native -ffast-math -flto=auto -fno-math-errno \
-floop-interchange -floop-strip-mine \
-funroll-loops -mprefer-vector-width=256 \
-fprefetch-loop-arrays \
-o dcf /usr/local/lib/sisal/srt0.o dcf.c \
-I/usr/local/include/sisal \
-L/usr/local/lib/sisal \
-Wl,--whole-archive -lsisal -Wl,--no-whole-archive \
-lmAs the benchmark shows below, zig cc seems to outperform gcc when we turn the optimisers on (!?), likely owing to LLVM and some better optimisation under the hood.
We have a Sisal programme dcf.sis that calculates
Note that this is a toy problem to show the differences in run-time under different compilations. This may change depending on problem, but I have found a 'nerd-sniped' zig cc call to produce the fastest executables for Sisal programmes. YMMV.
Sisal's optimiser aggressively 'constant-folds' expressions whose values can be determined at compile time. A matrix populated with literal values (e.g. all 1s multiplied by all 2s) will have the entire computation folded away at compile time, producing unrealistically fast benchmark times. Matrix elements must be generated from expressions the compiler cannot statically resolve -- for example, using index-derived arithmetic like i * j + 1 — to ensure the computation genuinely occurs at runtime and benchmark results are meaningful.
For instance example/dcf_constant.sis runs at ~4ms single-threaded using plain sisalc -- faster than any of the benchmarks below. To keep it fair, the dependency is introduced. This is very helpful when you need it, but not for these benchmarks.
- CPU: 11th Gen Intel i5-11400H (12) @ 4.500GHz
- Memory: 16GB RAM
- OS: Pop!_OS 24.04 LTS x86_64
- Kernel: 6.18.7-76061807-generic
- GCC: 13.3.0
- Zig: 0.12.0-dev.2644+42fcca49c
- Sisal: 14.1.0
| # | Compiler | Compile Flags | Execution | Wall Time | CPU% |
|---|---|---|---|---|---|
| 1 | sisalc |
(default) | single-threaded | 1.686s | 99% |
| 2 | sisalc |
(default) | -w12 + taskset |
0.324s | 717% |
| 3 | sisalc |
-no-bounds -aggvector -vector -p=12 |
-w12 + taskset |
0.181s | 628% |
| 4 | zig cc |
(none) | single-threaded | 24.602s | 99% |
| 5 | zig cc |
(none) | -w12 + taskset |
4.628s | 1108% |
| 6 | zig cc |
-O3 |
single-threaded | 1.244s | 99% |
| 7 | zig cc |
-O3 |
-w12 + taskset |
0.163s | 584% |
| 8 | zig cc |
-O3 -march=native -ffast-math -flto=thin + LLVM flags |
single-threaded | 0.493s | 99% |
| 9 | zig cc |
-O3 -march=native -ffast-math -flto=thin + LLVM flags |
-w12 + taskset |
0.063s | 266% |
| 10 | gcc |
(none) | single-threaded | 2.267s | 99% |
| 11 | gcc |
(none) | -w12 + taskset |
0.480s | 816% |
| 12 | gcc |
-O3 |
single-threaded | 1.383s | 99% |
| 13 | gcc |
-O3 |
-w12 + taskset |
0.298s | 663% |
| 14 | gcc |
-O3 -march=native -ffast-math -flto=auto + loop flags |
single-threaded | 1.433s | 99% |
| 15 | gcc |
-O3 -march=native -ffast-math -flto=auto + loop flags |
-w12 + taskset |
0.187s | 598% |
Overall, optimised sisalc compilation with multi-threading (a freebie by design) can get you pretty far. Compared to plain zig cc and gcc, it seems to apply some of its own optimisations. Zig with -O3 takes it up a notch, but the whole set of flags make a difference in run-time, about 2.85x the fastest I could get with sisalc-only optimisation. gcc had a better 'unoptimised' run than zig cc, but worse than sisalc.
These numbers also depend on some environmental factors so they really just give a general idea. Overall, ~63ms is pretty decent on fairly large matrix-vector multiplication using for-loops!