Comparison with Kryo IO #1

Neiko2002 · 2024-10-31T14:21:53Z

How does the performance of this library compare to the IO classes of Kryo?

Like
https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/io/ByteBufferInput.java

or the older fast Inputs:
https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/unsafe/UnsafeByteBufferInput.java

szeiger · 2024-11-23T01:05:53Z

I haven't run any benchmarks against Kryo and hadn't looked at it before, but looking through ByteBufferInput now, I see several patterns that I avoided for performance reasons. Someone should benchmark it :-)

I expect it to be much faster than DataInputStream/BufferedInputStream but still significantly slower than perfio.BufferedInput (within a factor of 2).

I did run some tests with unsafe array access for perfIO but the results were not compelling. This is certainly worth re-examining for various use cases. With tricks in the range checks to elide unnecessary double checks in many cases, and the VarHandle based array accessor, the results so far didn't justify additional unsafe code paths.

szeiger · 2024-11-23T22:02:14Z

I've added Kryo to BufferedInputNumBenchmark and BufferedOutputNumBenchmark. Even the unsafe version turned out to be more than a factor of 2 slower than perfIO.

[info] Benchmark                                                Mode  Cnt    Score    Error  Units
[info] BufferedInputNumBenchmark.array_BufferedInput_fromArray  avgt    7   14.915 ±  0.766  ms/op
[info] BufferedInputNumBenchmark.array_ByteBuffer               avgt    7   20.985 ±  0.858  ms/op
[info] BufferedInputNumBenchmark.array_DataInputStream          avgt    7  767.910 ±  1.186  ms/op
[info] BufferedInputNumBenchmark.array_Kryo                     avgt    7  139.057 ± 22.666  ms/op
[info] BufferedInputNumBenchmark.array_KryoUnsafe               avgt    7   62.688 ±  9.583  ms/op

[info] Benchmark                                                                  Mode  Cnt    Score    Error  Units
[info] BufferedOutputNumBenchmark.array_ByteBuffer                                avgt   15   31.351 ±  0.219  ms/op
[info] BufferedOutputNumBenchmark.array_DataOutputStream_growing                  avgt   15  836.983 ± 11.127  ms/op
[info] BufferedOutputNumBenchmark.array_DataOutputStream_preallocated             avgt   15  797.732 ±  0.997  ms/op
[info] BufferedOutputNumBenchmark.array_FullyBufferedOutput_growing               avgt   15   76.946 ±  7.284  ms/op
[info] BufferedOutputNumBenchmark.array_FullyBufferedOutput_growing_preallocated  avgt   15   28.206 ±  0.145  ms/op
[info] BufferedOutputNumBenchmark.array_KryoUnsafe_growing                        avgt   15  115.339 ±  1.080  ms/op
[info] BufferedOutputNumBenchmark.array_KryoUnsafe_preallocated                   avgt   15   61.774 ±  4.416  ms/op
[info] BufferedOutputNumBenchmark.array_Kryo_growing                              avgt   15  176.769 ±  0.834  ms/op
[info] BufferedOutputNumBenchmark.array_Kryo_preallocated                         avgt   15  129.308 ±  0.568  ms/op

I also ran some more tests with Unsafe for both input and output, but it's slower than the current safe VarHandle-based access.

szeiger · 2024-11-23T23:30:24Z

Oops, I didn't realize that the regular Input and Output aren't abstract. The previous data was for the ByteBuffer-based implementations. This includes the array-based ones:

[info] Benchmark                                                Mode  Cnt    Score    Error  Units
[info] BufferedInputNumBenchmark.array_BufferedInput_fromArray  avgt    7   15.377 ±  0.985  ms/op
[info] BufferedInputNumBenchmark.array_Kryo                     avgt    7   40.190 ±  0.479  ms/op
[info] BufferedInputNumBenchmark.array_KryoBB                   avgt    7  138.786 ± 23.067  ms/op
[info] BufferedInputNumBenchmark.array_KryoBBUnsafe             avgt    7   63.079 ± 14.864  ms/op
[info] BufferedInputNumBenchmark.array_KryoUnsafe               avgt    7   19.830 ±  0.290  ms/op

[info] Benchmark                                                                  Mode  Cnt    Score   Error  Units
[info] BufferedOutputNumBenchmark.array_FullyBufferedOutput_growing               avgt   15   77.859 ± 7.204  ms/op
[info] BufferedOutputNumBenchmark.array_FullyBufferedOutput_growing_preallocated  avgt   15   28.188 ± 0.162  ms/op
[info] BufferedOutputNumBenchmark.array_KryoBBUnsafe_growing                      avgt   15  116.250 ± 2.286  ms/op
[info] BufferedOutputNumBenchmark.array_KryoBBUnsafe_preallocated                 avgt   15   61.142 ± 4.556  ms/op
[info] BufferedOutputNumBenchmark.array_KryoBB_growing                            avgt   15  177.412 ± 3.084  ms/op
[info] BufferedOutputNumBenchmark.array_KryoBB_preallocated                       avgt   15  129.136 ± 0.256  ms/op
[info] BufferedOutputNumBenchmark.array_KryoUnsafe_growing                        avgt   15   74.337 ± 7.044  ms/op
[info] BufferedOutputNumBenchmark.array_KryoUnsafe_preallocated                   avgt   15   50.421 ± 0.975  ms/op
[info] BufferedOutputNumBenchmark.array_Kryo_growing                              avgt   15  105.148 ± 3.687  ms/op
[info] BufferedOutputNumBenchmark.array_Kryo_preallocated                         avgt   15   45.226 ± 0.550  ms/op

This matches my original expectation: Kryo is significantly slower, but within a factor of 2.

szeiger · 2024-11-24T00:03:51Z

Huh, the growing output is actually a bit faster with Kryo's unsafe implementation. (The numbers above aren't great; it takes too long for HotSpot optimization to stabilize, but more warmup leads to the same results.) This is unexpected since the flushing implementation is not overridden in UnsafeOutput and the preallocated UnsafeOutput version is actually slower than the safe one. And the difference is even larger when flushing to a FileOutputStream. I suppose I need to run some more experiments with Unsafe specifically for these use cases. The previous tests were all writing to a preallocated buffer (because that produces the most consistent results).

szeiger · 2024-11-24T02:27:00Z

I'm reopening this ticket since there's some opportunity for performance improvement based on comparison with Kryo's UnsafeOutput. Kryo doesn't have any of the block management, support for growing and nesting outputs, etc., so the implementation ends up being much smaller and simpler, but that shouldn't be too much of an issue if we can ensure that the hot paths get inlined and the cold ones stay out of the way.

I've identified several factors that result in lower performance compared to Kryo's UnsafeOutput when flushing to a file:

Nested block flushing is a lot of code and not necessary when there are no nested blocks. Factoring it out into a separate method leads to better inlining.
UnsafeOutput always uses the native byte order. The benchmarks make perfIO use big endian which is naturally slower on a little-endian machine. In addition, perfIO has to check endianness on every read/write, and HotSpot can't fully eliminate that. This gives Kryo and DataOutputStream an unfair advantage. I've considered endian-specific read/write methods before. Maybe it's time to add them because they really pay off in this benchmark. The alternative would be endian-specific classes (so you can't switch dynamically) but that would complicate the API a lot.
Unsafe access itself helps in these benchmarks, so an optional unsafe code path would be useful. I think this can be integrated similar to StringUtils via load-time lookup so it can be fully inlined for zero overhead.
The "clever" range checks in fwd() contributed to not getting a benefit from unsafe access in previous benchmarks because the automatic range check could be optimized away. The manual check is more expensive though, so it shouldn't be used in unsafe mode.
At least in unsafe mode, an optimized int8() without count and pre-increment is faster.
Normal array access may be faster than unsafe int8() (but this needs more benchmarking to check how it interacts with range check elimination in different scenarios).

Neiko2002 · 2024-11-24T16:14:30Z

I benchmarked KryoUnsafe on Java 8 several years ago and found it to be quite fast. It's great to see that it remains competitive, and even better to know there's still room for improvement. perfio is already the fastest and will become even faster with these changes.

szeiger · 2024-11-24T17:35:24Z

My assessment last night may have been too optimistic. I got a stripped-down version of BufferedOutput down to the same speed as Kryo's UnsafeOutput in this specific benchmark (where it was slower than Kryo before), but I'm not having much luck doing it with the full-featured version so far. Using Unsafe does result in a higher speed on this micro-benchmark, but the other changes are not helping. They probably rely on inlining of the outer buffering loop that we're not getting with the full version.

In particular, the clever range check is still much better overall than the direct one, even with unsafe array access. Only a minimized micro-benchmark showed a better performance with the direct version.

szeiger closed this as completed Nov 23, 2024

szeiger reopened this Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison with Kryo IO #1

Comparison with Kryo IO #1

Neiko2002 commented Oct 31, 2024

szeiger commented Nov 23, 2024

szeiger commented Nov 23, 2024

szeiger commented Nov 23, 2024

szeiger commented Nov 24, 2024

szeiger commented Nov 24, 2024

Neiko2002 commented Nov 24, 2024

szeiger commented Nov 24, 2024

Comparison with Kryo IO #1

Comparison with Kryo IO #1

Comments

Neiko2002 commented Oct 31, 2024

szeiger commented Nov 23, 2024

szeiger commented Nov 23, 2024

szeiger commented Nov 23, 2024

szeiger commented Nov 24, 2024

szeiger commented Nov 24, 2024

Neiko2002 commented Nov 24, 2024

szeiger commented Nov 24, 2024