FastDGEMM

Simple multithread version of Hennessy-Patterson optimized dgemm routine on C++.

Short benchmark code for the OpenBLAS library is added for comparison.

I would be happy to know your opinion about this experiment and get some feedback about the general methodology, benchmark reliability, and quality of the code itself.

I would also appreciate any idea on how to improve the performance of this code.

Stages of improvement ("Getting faster"):

Compiler optimizations

AVX x86 intrinsics

Loop unrolling (-O3 compiler option shoul be enabled)

New feature - multithreading using std::thread, realization maybe isn't perfect

Cache blocking (controlling the age of the array accesses)

Comparison with OpenBLAS. We've got a really big trip ahead...

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
benchmarks		benchmarks
.gitattributes		.gitattributes
README.md		README.md
my_dgemm.cpp		my_dgemm.cpp
openblas_dgemm.c		openblas_dgemm.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastDGEMM

About

Releases

Packages

Languages

NikitaMatckevich/FastDGEMM

Folders and files

Latest commit

History

Repository files navigation

FastDGEMM

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages