Simple multithread version of Hennessy-Patterson optimized dgemm routine on C++.
Short benchmark code for the OpenBLAS library is added for comparison.
I would be happy to know your opinion about this experiment and get some feedback about the general methodology, benchmark reliability, and quality of the code itself.
I would also appreciate any idea on how to improve the performance of this code.
Stages of improvement ("Getting faster"):
Compiler optimizations
AVX x86 intrinsics
Loop unrolling (-O3 compiler option shoul be enabled)
New feature - multithreading using std::thread, realization maybe isn't perfect
Cache blocking (controlling the age of the array accesses)
Comparison with OpenBLAS. We've got a really big trip ahead...