Description
Unless doing instruction level profiling, the highest precision timer on a modern computer is about 25-32 cycles
(which is, under ideal circumstances, about 6ns
on a 5GHz processor at best and 32ns
on a 1GHz processor).
Due to platform specific differences, the maximum reported difference in the high precision timer APIs exposed by the OS is about 100ns
. Additionally, it is well documented that due to the latency between calls and other factors on the OS or hardware, the latency for such a call can be much worse, such as closer to 300ns
when a CPU level timer such as RDTSC
is not available: https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps#resolution-precision-accuracy-and-stability.
While Benchmark.NET does try to account for small methods and while it also tries to account for noise due to call overhead and the like, there are many cases where the numbers it reports are of questionable accuracy.
One such example is the following:
In particular, if we look at the first entry GetShortName_opt
is reporting a time of 0.2082 ns
. Even in an "ideal" scenario where the JIT is able to fully optimize the comparison against a constant value and optimize it to simple be xor rax, rax
, this is still reporting that it takes approximately 1 cycle on a 5GHz CPU.
- It also shouldn't be able to optimize it like this. AFAIR, Benchmark.NET should be passing the value in and preventing the actual benchmark body from being inlined to avoid such issues.
It would be beneficial, IMO, if Benchmark.NET was more proactive about labeling potentially problematic results and had guidance on how to optimally write a test in a way that will provide accurate results.
- I would view a problematic result, at the very least, as anything taking less than 10ns. Most of these methods should be testing more than a single instruction and are running on 2-4GHz computers. So in an "ideal" environment, 10ns represents no more than 20 instructions and likely no memory accesses. Very few instructions take 0 cycles. Several take 1 cycle and can be pipelined for up to 4 to be in simultaneous dispatch, but its rare to actually have this. Many take 2-3 cycles and if you have any kind of memory access they will take about 3-11 cycles in the fastest scenario (potentially longer for uncached results among other things).