Skip to content

Latest commit

 

History

History
60 lines (41 loc) · 2.66 KB

BENCHMARK.md

File metadata and controls

60 lines (41 loc) · 2.66 KB

Simple Benchmark

Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W

Network Code: test/benchmark.py

F32/F16 Spconv 1.x F32 (1080Ti) Native Implicit Gemm Implicit Gemm Split Mask
Forward 43ms 21.7ms/13.7ms 23.5ms/11.2ms 22ms/12.2ms
Backward 80ms 41.9ms/25.2ms 51.0ms/13.8ms 41.1ms/12.2ms
F16 Forward Native Implicit Gemm Implicit Gemm Split Mask
RTX 3080 Laptop 150W 13.7ms 11.2ms 12.2ms
RTX A6000 19.1ms 11.7ms 14.0ms
TESLA V100 17.9ms 11.4ms 13.4ms
F16 Backward Native Implicit Gemm Implicit Gemm Split Mask
RTX 3080 Laptop 150W 25.2ms 13.8ms 12.2ms
RTX A6000 28.1ms 9.2ms 8.9ms
TESLA V100 33.9ms 12.2ms 12.9ms

Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU

Network Code: test/benchmark.py

The network/input/profile code is same as above table.

This table only profile fp16 gemm kernels without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.

F16 Native Implicit Gemm Implicit Gemm Split Mask
Forward 8.0ms 4.3ms 4.0ms

We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.

NOTE When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.

Comparsion with MinkowskiEngine and torchsparse

TODO