Introduction

This repository contains code that tries to replicate the performance improvement of the dense matrix-matrix multiplication program discussed in the first lecture of the 6.172 Performance Engineering of Software Systems course at MIT. The video recording of prof. Charles Leiserson is here; the lecture is from Fall 2018, so a few years have passed since then.

Three languages are used for the implementation:

Python
Java
C99

I started from the source code shown in the slides that can be found here. However, substantial modifications were made, especially to the C code, so I do not claim that my program is anywhere equivalent to that used in the MIT course.

Algorithms

The C program includes several implementations of the matrix-matrix multiplication algorithm, with various levels of optimization.

The starting point is the ijk algorithm, that refers to the classic "textbook" matmul program; it corresponds to the following C code:

/* Compute r = p * q */
void matmul(const float * restrict p, const float * restrict q, float * restrict r, int n)
{
  for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
      r[i*n + j] = 0.0f;

  for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
      for (int k=0; k<n; k++)
        r[i*n + j] += p[i*n + k] * q[k*n + j];
}

The program is split into several source files that are compiled separately with different flags.

Source	Description	Compiler flags
`matmul_ijk.c`	Unoptimized reference algorithm	(none)
`matmul__opt.c`	Compiler optimization, all permutations of the nested loops	`-O2`
`matmul_ikj_simd_auto.c`	ikj nesting + compiler vectorization	`-O2 -ffast-math -ftree-vectorize -march=native`
`matmul_ikj_simd.c`	ikj nesting + manual vectorization using AVX2 SIMD extensions	`-O2 -ffast-math -mavx2 -mfma`
`matmul_ikj_omp.c`	ikj nesting + parallel loops using OpenMP	`-O2 -fopenmp`
`matmul_ikj_simd_omp.c`	ikj nesting + manual vectorization + parallel loops	`-O2 -fopenmp -mavx2 -mfma -ffast-math`
`matmul_dac.c`	Divide-and-conquer + compiler vectorization + OpenMP	`-O2 -fopenmp -ftree-vectorize -march=native -ffast-math`

Compilation

To compile type:

make

The provided Makefile compiles the C and Java programs using the appropriate compiler flags. The C program requires gcc, and requires OpenMP support.

Execution

Run the Python program as:

./matmul-test.py [n]

where the optional parameter n is the matrix size (default 4096).

Run the Java program as:

java matmul_test [n]

Finally, run the C program as:

./matmul-test [n]

Note

Modern processors have separate P(erformance) and E(fficient) cores that operate ad different clock rates. For example, on one of my systems (12th Gen Intel(R) Core(TM) i9-12900F processor) the OS reports 24 cores, of which 16 are P-cores where adjacent pairs are mapped to the same (hyperthreaded) physical core. On this machine, to use the P-cores only, set the OMP_PLACES and OMP_NUM_THREADS environment variables as follows:

OMP_NUM_THREADS=16 OMP_PLACES="0:15" ./matmul-test

Results

Performance results on several machines are listed in the file matmul_results.ods. I report below one of the data tables.

Hardware/Software configuration

Feature	Value
Processor Type	Intel(R) Core(TM) i7-9800X
Clock frequency	3.8 GHz
Processor chips	1
Processing cores	8
Threads per core	2
Floating-point unit	742 GFLOPS¹
L1-icache	256 KB x 8
L1-dcache	256 KB x 8
L2-cache	8 MB x 8
L3-cache	16.5 MB
DRAM	32 GB
OS	Ubuntu 24.04 Linux 6.14.0-29-generic
Python	3.12.3
Java	openjdk 21.0.8 2025-07-15
gcc	13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
Matrix size	4096

Results

	Time (s)	GFLOPS	% of peak	Speedup
Python	13422.66	0.01	0.00%	1.00
Java	463.77	0.30	0.04%	28.94
Serial ijk	1582.71	0.09	0.01%	8.48
Serial ijk + O2	559.39	0.25	0.03%	24.00
Serial ikj + O2	43.73	3.14	0.42%	306.94
Serial jik + O2	366.82	0.37	0.05%	36.59
Serial jki + O2	1108.43	0.12	0.02%	12.11
Serial kij + O2	44.57	3.08	0.42%	301.17
Serial kji + O2	1157.95	0.12	0.02%	11.59
ikj + auto SIMD	21.97	6.25	0.84%	610.87
ikj + manual SIMD	23.14	5.94	0.80%	580.14
ikj + manual SIMD + OpenMP	1.84	74.82	10.08%	7306.84
Parallel divide-and-conquer	1.06	129.66	17.47%	12662.88

Each iteration of the innermost loop performs two floating-point operations (one addition and one multiplication). The total number of floating-point operations is $2n^3$, where $n$ is the matrix size, and the GFLOPS are:

$$\text{GFLOPS} = \frac{2n^3}{\text{Execution Time}}$$

Discussion

The most striking result is that Java is four times faster than unoptimized C code. This is in stark contrast with the results from the MIT lecture from 2018, where Java was two times slower. Java is currently still faster than the optimized code produced by gcc with the -O2 flag. This happens on all processors I ran the programs on, which however have identical software stack.
Different permutations of the loop nest produce large variations in the execution time, from 43s to 1582s. This is consistent with the MIT lecture, and is due to the different memory access patterns that are more or less cache-friendly.
Compiler auto-vectorization produces faster code than manual vectorization using AVX2 extensions. The manually-written (by me) SIMD code is therefore sub-optimal.
The single-core SIMD versions are only twice as fast as the unvectorized code with the optimal lopo nest. This is a bit disappointing, since we are using AVX2 instructions that can process 8 floats at the same time, so one might expect a speedup close to $8\times$.

Peak performance (GFLOPS) taken from this page. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
common.h		common.h
matmul-test.c		matmul-test.c
matmul-test.py		matmul-test.py
matmul_dac.c		matmul_dac.c
matmul_ijk.c		matmul_ijk.c
matmul_ikj_omp.c		matmul_ikj_omp.c
matmul_ikj_simd.c		matmul_ikj_simd.c
matmul_ikj_simd_auto.c		matmul_ikj_simd_auto.c
matmul_ikj_simd_omp.c		matmul_ikj_simd_omp.c
matmul_opt.c		matmul_opt.c
matmul_test.java		matmul_test.java
runtests.sh		runtests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Algorithms

Compilation

Execution

Results

Hardware/Software configuration

Results

Discussion

About

Uh oh!

Releases

Packages

Languages

License

mmarzolla/matmul-test

Folders and files

Latest commit

History

Repository files navigation

Introduction

Algorithms

Compilation

Execution

Results

Hardware/Software configuration

Results

Discussion

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages