Description
During the warmup phase, the native benchmarks executor estimates the number of repetitions for each measurement iteration (i.e. the number of times the inner loop should invoke a benchmarked function).
Unfortunately, for short-running benchmarks, the machinery required for the warmup adds an overhead that skews the average time per benchmark invocation. That causes the estimated repetitions number to be too low, and as a result, the executor spends significantly less time in each measurement iteration than a user configured.
Consider the following example (https://github.com/fzhinkin/kt-64361-benchmarks/blob/main/kmp-benchmarks/src/commonMain/kotlin/SignumBenchmarks.kt):
… org.example.LongSignumBenchmark.signBitExtractingSignum
Warm-up #0: 24.6405 ns/op
Warm-up #1: 24.7149 ns/op
Warm-up #2: 24.7893 ns/op
Warm-up #3: 24.6666 ns/op
Warm-up #4: 24.6982 ns/op
Iteration #0: 3.97785 ns/op
Iteration #1: 3.99065 ns/op
Iteration #2: 3.95236 ns/op
...
As you can see, the average time reported during warmup is 6 times lower compared to what was measured later.
Not sure how to represent it in a textual form, but if you run the benchmark locally and watch the log messages being printed, you'll notice that each measurement iteration takes significantly less time compared to the warmup.
In my case, the duration of a single iteration was set to 1 second, and during warmup, it felt like it took about that time for each iteration. But when it comes to measurement, each iteration ends almost instantly.
The problem here is that the warmup phase makes a call to get a current timestamp after every single benchmark method call. For long-running benchmarks, it's hard to notice a difference, but for short-running benchmarks, it adds a significant overhead.
As it may not be possible to create a separate thread performing the warmup so it could be interrupted by the timer on all platforms (or we can simply create a pthread?), maybe we can adjust the procedure to check the elapsed time every N benchmarks calls?