This repository contains code that tries to replicate the performance improvement of dense matrix-matrix multiplication discussed in the first lecture of the 6.172 Performance Engineering of Software Systems course at MIT. The video recording of prof. Charles Leiserson discussing this topic is here. Note that the lecture is from 2018, so a few years have passed since then.
Three languages are used for the implementation:
- Python
- Java
- C99
I started from the source code shown in the slides here. However, substantial modifications were made, especially to the C code, so I do not claim that my program is anywhere equivalent to that used in the MIT course. As usual I take the blame for any error.
The C program includes several implementations of the matrix-matrix multiplication algorithm, with various levels of optimization.
The starting point is the ijk algorithm, that refers to the classic "textbook" program that corresponds to the following C code:
/* Compute r = p * q */
void matmul(const float * restrict p, const float * restrict q, float * restrict r, int n)
{
for (int i=0; i<n; i++)
for (int j=0; j<n; j++)
r[i*n + j] = 0.0f;
for (int i=0; i<n; i++)
for (int j=0; j<n; j++)
for (int k=0; k<n; k++)
r[i*n + j] += p[i*n + k] * q[k*n + j];
}Upon that, several optimizations are applied.
The source code is split into several source files that are compiled separately, possibly with different compiler flags as shown in the table below.
| Source | Description | Compiler flags |
|---|---|---|
matmul_ijk.c |
Unoptimized reference algorithm | (none) |
matmul_opt.c |
Compiler optimization | -O2 |
matmul_ikj_simd_auto.c |
ikj nesting + compiler vectorization | -O2 -ffast-math -ftree-vectorize -march=native |
matmul_ikj_simd.c |
ikj nesting + manual vectorization using SIMD extensions | -O2 -ffast-math -march=native |
matmul_ikj_omp.c |
ikj nesting + parallel loops using OpenMP | -O2 -fopenmp |
matmul_ikj_simd_omp.c |
ikj nesting + manual vectorization + parallel loops | -O2 -fopenmp -march=native -ffast-math |
matmul_dac.c |
Divide-and-conquer + compiler vectoriation + OpenMP tasking | -O2 -fopenmp -ftree-vectorize -march=native -ffast-math |
To compile type:
make
The provided Makefile compiles the C and Java programs using the
appropriate compiler flags. The C program requires gcc and relies on
OpenMP support. The program also compiles with LLVM/clang using the
following command:
CC=clang make
Run the Python program as:
./matmul-test.py [n]
where the optional parameter n is the matrix size (default 4096;
WARNING the Python program may take several hours with the default
matrix size!).
Run the Java program as:
java matmul_test [n]
Finally, run the C program as:
./matmul-test [n]
By default, the C program executes all algorithms; it is possible to
use the command-line flag -a [keys] to restrict execution to a
subset of the algorithms. [keys] is a sequence of characters that
represent the algorithms to run, according to the following table:
| Key | Algorithm |
|---|---|
a |
Serial ijk |
b |
Serial ijk + O2 |
c |
Serial ikj + O2 |
d |
Serial jik + O2 |
e |
Serial jki + O2 |
f |
Serial kij + O2 |
g |
Serial kji + O2 |
h |
ikj + O2 + auto SIMD |
i |
ikj + O2 + manual SIMD |
j |
ikj + O2 + manual SIMD + OpenMP |
k |
Parallel divide-and-conquer |
Therefore, to execute the unoptimized ijk algorithm (key a) and
the parallel divide-and-conquer (key k) on matrices of size
./matmul-test -a ak 2048
Note
Modern processors have separate P(erformance) and E(fficient) cores
that operate ad different clock rates. For example, on one of my
systems (12th Gen Intel(R) Core(TM) i9-12900F processor) the OS
reports 24 cores, of which the first 16 are P-cores where adjacent
pairs are mapped to the same (hyperthreaded) physical core. To use
the P-cores only, I use the OMP_PLACES and OMP_NUM_THREADS
environment variables as follows:
OMP_NUM_THREADS=16 OMP_PLACES="0:15" ./matmul-test
Note
These results are not intended to provide a reliable benchmark of the processor nor the compilers/interpreter used.
The following table reports the hardware/software configuration of the machine used for the tests; the program has been executed on the bare hardware.
| Feature | Value |
|---|---|
| Processor Type | Intel(R) Core(TM) i9-12900F |
| Clock frequency | 5.1 GHz (P-core), 3.8 GHz (E-core) |
| Processor chips | 1 |
| Processing cores | 8 P-cores + 8 E-cores |
| Threads per core | 2 per P-core, 1 per E-core |
| Floating-point unit | 614.4 GFLOPS1 |
| SIMD extensions | MMX, SSE, SSE2, SSE3, SSE4_1, SSE4_2, AVX2 |
| L1-dcache | 640 KB total (16 instances) |
| L1-icache | 768 KB total (16 instances) |
| L2-cache | 14 MB total (10 instances) |
| L3-cache | 30 MB total (one instance) |
| DRAM | 64 GB |
| OS | Ubuntu 24.04 Linux 6.14.0-37-generic |
| Python | 3.12.3 |
| Java | openjdk 21.0.10 2026-01-20 |
| gcc | 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) |
| Matrix size | 4096 |
- Time is the execution time, in seconds, average of five independent executions;
-
GFLOPS is an estimate of the number of floating-point operations, in billions (
$10^9$ , see below); - % of peak is the percentage of peak performance achieved by the algorithm; the peak performance of this machine is 614 GFLOPS;
- Speedup is the relative speedup with respect with Python.
| Time (s) | GFLOPS | % of peak | Speedup | |
|---|---|---|---|---|
| Python | 6711.862 | 0.02 | 0.00% | 1.00 |
| Java | 322.02 | 0.43 | 0.07% | 20.84 |
| Serial ijk | 489.84 | 0.28 | 0.05% | 13.70 |
| Serial ijk + O2 | 322.20 | 0.41 | 0.07% | 20.20 |
| Serial ikj + O2 | 22.19 | 6.19 | 1.01% | 302.49 |
| Serial jik + O2 | 182.30 | 0.75 | 0.12% | 36.82 |
| Serial jki + O2 | 577.15 | 0.24 | 0.04% | 11.43 |
| Serial kij + O2 | 21.32 | 6.45 | 1.05% | 314.79 |
| Serial kji + O2 | 546.62 | 0.25 | 0.04% | 12.28 |
| ikj + auto SIMD | 10.16 | 13.52 | 2.20% | 660.32 |
| ikj + manual SIMD | 12.50 | 11.00 | 1.79% | 537.10 |
| ikj + manual SIMD + OpenMP | 1.24 | 110.70 | 18.03% | 5405.80 |
| Parallel divide-and-conquer | 0.43 | 316.68 | 51.58% | 15465.08 |
Each iteration of the innermost loop performs two floating-point
operations (one addition and one multiplication). Therefore, the total
number of floating-point operations is
- The most striking result is that Java is 1.5 times faster than
unoptimized C code. This is in stark contrast with the results shown
in the MIT lecture from 2018, where Java was two times slower. Java
is still faster, by a small margin, than optimized code produced by
gccwith the-O2flag. - Different permutations of the loop nest show wide variations in the execution time. This is consistent with the MIT lecture, and is due to memory access patterns that are more or less cache-friendly.
- Compiler auto-vectorization produces surprisingly efficient code that is faster than hand-coded SIMD.
- The single-core SIMD version is only twice as fast as the
unvectorized code. This is disappointing since we are using AVX2
instructions with 256-bit registers that can hold 8 floats, so one
would expect a speedup close to
$8\times$ .