Commit f7c1a6d

authored

Add nvbench-based benchmarks for the fltflt data types (#1124)

* Add nvbench-based benchmarks for the fltflt data types Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON. Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt benchmarks and summarizes the results. Example results when running on RTX PRO 6000 Blackwell Server Edition are as follows: Performance relative to single-precision (float = 1.0x baseline) Higher values indicate slower performance Benchmark float double fltflt fltflt vs dbl ------------------------------------------------------------------ add 1.00x 71.10x 28.84x 2.47x sub 1.00x 71.11x 28.85x 2.46x mul 1.00x 71.17x 10.15x 7.01x div 1.00x 52.63x 5.85x 8.99x sqrt 1.00x 52.40x 3.89x 13.48x abs 1.00x 2.17x 2.15x 1.01x fma 1.00x 71.13x 25.36x 2.81x madd 1.00x 71.14x 38.78x 1.83x ------------------------------------------------------------------- Note that addition and subtration and only ~2.5x faster using fltflt than fp64. Multiplication, division, and square root are significantly faster. Future updates may improve addition performance, but potentially at an accuracy cost, so the changes will likely be opt-in. Signed-off-by: Thomas Benson <tbenson@nvidia.com> * Add adding guards in run_fltflt_benchmarks.py parsing Signed-off-by: Thomas Benson <tbenson@nvidia.com> --------- Signed-off-by: Thomas Benson <tbenson@nvidia.com>

1 parent b221325 commit f7c1a6dCopy full SHA for f7c1a6d

4 files changed

+883

-0

lines changed

bench
- 00_misc
  - fltflt_arithmetic.cu
- CMakeLists.txt
- scripts
  - run_fltflt_benchmarks.py
include/matx/kernels
- fltflt.h

4 files changed

+883

-0

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit f7c1a6d

4 files changed

4 files changed

File tree

4 files changed

4 files changed

0 commit comments