Commit f7c1a6d
authored
Add nvbench-based benchmarks for the fltflt data types (#1124)
* Add nvbench-based benchmarks for the fltflt data types
Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON.
Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt
benchmarks and summarizes the results. Example results when running
on RTX PRO 6000 Blackwell Server Edition are as follows:
Performance relative to single-precision (float = 1.0x baseline)
Higher values indicate slower performance
Benchmark float double fltflt fltflt vs dbl
------------------------------------------------------------------
add 1.00x 71.10x 28.84x 2.47x
sub 1.00x 71.11x 28.85x 2.46x
mul 1.00x 71.17x 10.15x 7.01x
div 1.00x 52.63x 5.85x 8.99x
sqrt 1.00x 52.40x 3.89x 13.48x
abs 1.00x 2.17x 2.15x 1.01x
fma 1.00x 71.13x 25.36x 2.81x
madd 1.00x 71.14x 38.78x 1.83x
-------------------------------------------------------------------
Note that addition and subtration and only ~2.5x faster using fltflt than
fp64. Multiplication, division, and square root are significantly faster.
Future updates may improve addition performance, but potentially at an
accuracy cost, so the changes will likely be opt-in.
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
* Add adding guards in run_fltflt_benchmarks.py parsing
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
---------
Signed-off-by: Thomas Benson <tbenson@nvidia.com>1 parent b221325 commit f7c1a6d
File tree
4 files changed
+883
-0
lines changed- bench
- 00_misc
- scripts
- include/matx/kernels
4 files changed
+883
-0
lines changed
0 commit comments