Simple Benchmark

Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W

Network Code: test/benchmark.py

F32/F16	Spconv 1.x F32 (1080Ti)	Native	Implicit Gemm	Implicit Gemm Split Mask
Forward	43ms	21.7ms/13.7ms	23.5ms/11.2ms	22ms/12.2ms
Backward	80ms	41.9ms/25.2ms	51.0ms/13.8ms	41.1ms/12.2ms

F16 Forward	Native	Implicit Gemm	Implicit Gemm Split Mask
RTX 3080 Laptop 150W	13.7ms	11.2ms	12.2ms
RTX A6000	19.1ms	11.7ms	14.0ms
TESLA V100	17.9ms	11.4ms	13.4ms

F16 Backward	Native	Implicit Gemm	Implicit Gemm Split Mask
RTX 3080 Laptop 150W	25.2ms	13.8ms	12.2ms
RTX A6000	28.1ms	9.2ms	8.9ms
TESLA V100	33.9ms	12.2ms	12.9ms

Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU

Network Code: test/benchmark.py

The network/input/profile code is same as above table.

This table only profile fp16 gemm kernels without output tensor create/clear overhead. this table show the performance upper bound of our algorithm.

F16	Native	Implicit Gemm	Implicit Gemm Split Mask
Forward	8.0ms	4.3ms	4.0ms

We can see that the implicit gemm is very fast, gemm only use 4.3ms/11.2ms in network forward. we can achieve better performance in TensorRT + Pure C++.

NOTE When you want to benchmark network in your laptop, don't forget to close all apps except terminals! Other apps will consume GPU resource and make kernels run slower.

Comparsion with MinkowskiEngine and torchsparse

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCHMARK.md

BENCHMARK.md

Simple Benchmark

Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W

Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU

Comparsion with MinkowskiEngine and torchsparse

Files

BENCHMARK.md

Latest commit

History

BENCHMARK.md

File metadata and controls

Simple Benchmark

Network Benchmark without batchnorm (F32/F16) in RTX 3080 Laptop GPU 150W

Network Gemm Kernel Benchmark FP16 in RTX 3080 Laptop GPU

Comparsion with MinkowskiEngine and torchsparse