Skip to content

Conversation

@samyron
Copy link
Contributor

@samyron samyron commented Jul 8, 2025

Overview

This PR uses the jdk.incubator.vector module as mentioned in issue #739 to accelerate generating JSON with the same algorithm as the C extension.

The PR as it exists right now, it will attempt to build the json.ext.VectorizedEscapeScanner class with a target release of 16. This is the first version of Java with support for the jdk.incubator.vector module. The remaining code is built for Java 1.8. The code will attempt to load the json.ext.VectorizedEscapeScanner only if the json.enableVectorizedEscapeScanner system property is set to true (or 1).

I'm not entirely sure how this is packaged / included with JRuby so I'd love @byroot and @headius's (and others?) thought about how to potential package and/or structure the JARs. I did consider adding the json.ext.VectorizedEscapeScanner to a separate generator-vectorized.jar but I thought I'd solicit feedback before spending any more time on the build / package process.

Benchmarks

Machine M1 Macbook Air

Note: I've had trouble modifying the compare.rb I was using for the C extension to work reliability with the Java extension. I'll probably spend more time trying to get it to work, but as of right now these are pretty raw benchmarks.

Below are two sample runs of the real-world benchmarks. The benchmarks are much more variable then the C extension for some reason. I'm not sure if HotSpot is doing something slightly different per execution.

Vector API Enabled

scott@Scotts-MacBook-Air json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=true' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.384k i/100ms
Calculating -------------------------------------
                json     15.289k (± 0.8%) i/s   (65.41 μs/i) -    153.624k in  10.048481s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    76.000 i/100ms
Calculating -------------------------------------
                json    753.787 (± 3.6%) i/s    (1.33 ms/i) -      7.524k in   9.997059s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   173.000 i/100ms
Calculating -------------------------------------
                json      1.751k (± 1.1%) i/s  (571.24 μs/i) -     17.646k in  10.081260s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.390k i/100ms
Calculating -------------------------------------
                json     23.829k (± 0.8%) i/s   (41.97 μs/i) -    239.000k in  10.030503s

Vector API Disabled

scott@Scotts-MacBook-Air json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=false' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
VectorizedEscapeScanner disabled.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.204k i/100ms
Calculating -------------------------------------
                json     12.937k (± 1.1%) i/s   (77.30 μs/i) -    130.032k in  10.052234s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    80.000 i/100ms
Calculating -------------------------------------
                json    817.378 (± 1.0%) i/s    (1.22 ms/i) -      8.240k in  10.082058s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   147.000 i/100ms
Calculating -------------------------------------
                json      1.499k (± 1.3%) i/s  (667.08 μs/i) -     14.994k in  10.004181s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.269k i/100ms
Calculating -------------------------------------
                json     22.366k (± 5.7%) i/s   (44.71 μs/i) -    224.631k in  10.097069s

master as of commit c5af1b68c582335c2a82bbc4bfa5b3e41ead1eba

scott@Scotts-MacBook-Air json % ONLY=json ruby -I"lib" benchmark/encoder-realworld.rb
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   886.000 i/100ms
Calculating -------------------------------------
                json^C%                                                                                                                   
scott@Scotts-MacBook-Air json % ONLY=json ruby -I"lib" benchmark/encoder-realworld.rb
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.031k i/100ms
Calculating -------------------------------------
                json     10.812k (± 1.3%) i/s   (92.49 μs/i) -    108.255k in  10.014260s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    82.000 i/100ms
Calculating -------------------------------------
                json    824.921 (± 1.0%) i/s    (1.21 ms/i) -      8.282k in  10.040787s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   141.000 i/100ms
Calculating -------------------------------------
                json      1.421k (± 0.7%) i/s  (703.85 μs/i) -     14.241k in  10.023979s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.274k i/100ms
Calculating -------------------------------------
                json     22.612k (± 0.9%) i/s   (44.22 μs/i) -    227.400k in  10.057516s

Observations

activitypub.json and twitter.json seem to be consistently faster with the Vector API enabled. citm_catalog.json seems consistently a bit slower and ohai.json is fairly close to even.

@samyron samyron force-pushed the sm/java-vector-simd branch from 194ba01 to 15c7187 Compare July 15, 2025 03:12
@samyron
Copy link
Contributor Author

samyron commented Jul 15, 2025

Using hsdis to examine the generated assembly I can verify that on my Macbook Air the Hotspot C2 Compiler does indeed use Neon instructions.

ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=true -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintIntrinsics -XX:CompileCommand=print,*VectorizedEscapeScanner.*' ruby -I"lib" benchmark/encoder-realworld.rb > output.txt 2>output.txt
Compiled method (c2)   22086 5801       4       json.ext.VectorizedEscapeScanner::scan (391 bytes)
<snip>

[Disassembly]
--------------------------------------------------------------------------------
[Constant Pool (empty)]

--------------------------------------------------------------------------------

[Entry Point]
  # {method} {0x0000000133c3a0d8} 'scan' '(Ljson/ext/EscapeScanner$State;)Z' in 'json/ext/VectorizedEscapeScanner'
  # this:     c_rarg1:c_rarg1 
                        = 'json/ext/VectorizedEscapeScanner'
  # parm0:    c_rarg2:c_rarg2 
                        = 'json/ext/EscapeScanner$State'
  #           [sp+0x30]  (sp of caller)
  0x000000011b28d0c0:   ldr		w8, [x1, #8]
  0x000000011b28d0c4:   cmp		w9, w8
  0x000000011b28d0c8:   b.eq		#0x11b28d0d0
  0x000000011b28d0cc:   b		#0x11aa5fe80        ;   {runtime_call ic_miss_stub}
[Verified Entry Point]
  0x000000011b28d0d0:   nop		
  0x000000011b28d0d4:   sub		x9, sp, #0x14, lsl #12
  0x000000011b28d0d8:   str		xzr, [x9]
  0x000000011b28d0dc:   sub		sp, sp, #0x30
 <snip>
  0x000000011b28d194:   add		x12, x5, w14, sxtw
  0x000000011b28d198:   ldr		q20, [x12, #0x10]
  0x000000011b28d19c:   eor		v21.16b, v20.16b, v17.16b
  0x000000011b28d1a0:   cmgt		v22.16b, v19.16b, v20.16b
  0x000000011b28d1a4:   cmgt		v21.16b, v18.16b, v21.16b
  0x000000011b28d1a8:   cmeq		v20.16b, v20.16b, v16.16b
  0x000000011b28d1ac:   bic		v21.16b, v21.16b, v22.16b
  0x000000011b28d1b0:   orr		v20.16b, v20.16b, v21.16b
  0x000000011b28d1b4:   str		w1, [x2, #0x30]
  0x000000011b28d1b8:   addv		b21, v20.16b
  0x000000011b28d1bc:   umov		w8, v21.b[0]
  0x000000011b28d1c0:   cmp		w8, wzr
  0x000000011b28d1c4:   b.ne		#0x11b28d40c
  0x000000011b28d1c8:   add		w14, w7, #0x10
  0x000000011b28d1cc:   ldr		x12, [x28, #0x450]
  0x000000011b28d1d0:   str		w14, [x2, #0x14]    ; ImmutableOopMap {c_rarg2=Oop c_rarg5=Oop }
                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                            ; - (reexecute) json.ext.VectorizedEscapeScanner::scan@308 (line 59)
<snip>

@headius
Copy link
Contributor

headius commented Jul 16, 2025

@samyron OMG I look away for a few days and you just go and do it! Bravo!

I'll have a look at these changes soon and see if I can offer any suggestions. This API is still a bit of a moving target, but I think we can work around that with a little Ruby magic here and there.

I will also point the Vector API folks at this PR so they can see what we're doing and provide additional input.

Amazing work!

@headius
Copy link
Contributor

headius commented Jul 16, 2025

I've posted a thread to the panama-dev list here: https://mail.openjdk.org/pipermail/panama-dev/2025-July/021080.html

@samyron
Copy link
Contributor Author

samyron commented Jul 28, 2025

I decided to try a different approach after looking at the HotSpot C2 output. Unlike in the C extension, where we mostly control method inlining, HotSpot isn't so easily influenced.

I merged VectorizedStringEncoder which wraps the escape logic in the vectorized scanning. This reduces method calls back to the search code.

Performance of VectorizedStringEncoder

scott@Scotts-MacBook-Air json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djson.enableVectorizedEscapeScanner=false -Djson.enableVectorizedStringEncoder=true' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
VectorizedEscapeScanner disabled.
json.ext.VectorizedStringEncoder loaded successfully.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.537k i/100ms
Calculating -------------------------------------
                json     15.382k (± 0.6%) i/s   (65.01 μs/i) -    155.237k in  10.092376s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json    81.000 i/100ms
Calculating -------------------------------------
                json    818.347 (± 0.7%) i/s    (1.22 ms/i) -      8.181k in   9.997474s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json   176.000 i/100ms
Calculating -------------------------------------
                json      1.766k (± 1.9%) i/s  (566.28 μs/i) -     17.776k in  10.070684s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a Java HotSpot(TM) 64-Bit Server VM 21.0.7+8-LTS-245 on 21.0.7+8-LTS-245 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.426k i/100ms
Calculating -------------------------------------
                json     23.958k (± 0.6%) i/s   (41.74 μs/i) -    240.174k in  10.025043s

Additionally, here is a screenshot of VisualVM showing the result of running the activitypub.json benchmark for 30 seconds.

image

@headius
Copy link
Contributor

headius commented Jul 28, 2025

@samyron This is interesting progress! I am looking forward to trying it myself now that I'm back in the office.

Yes, HotSpot can be a tricky beast to manipulate. We will want to look at some deeper logging of the JIT and inlining decisions to see whether everything that should be is getting inlined. There's potentially other parts of json unrelated to your changes that are also interfering with inlining (such as the double-dispatching logic to find an appropriate formatter for output text).

Have you tried running on a newer JDK? There's continuous improvements in this area.

@headius
Copy link
Contributor

headius commented Jul 28, 2025

It's also possible that we are losing too much performance to excessive allocation. I'll try to do some profiling once I get your code up and running.

@headius
Copy link
Contributor

headius commented Jul 28, 2025

Oh, BTW, I got one response to my email about your work, pointing me to a Java library that has already been attempting to use the vector API to speed up json processing. It may provide some interesting pointers: https://github.com/simdjson/simdjson-java

@samyron
Copy link
Contributor Author

samyron commented Jul 31, 2025

Have you tried running on a newer JDK? There's continuous improvements in this area.

I have tried running the same benchmarks using JDK 24.

These benchmarks have some WIP changes that aren't reflected in this branch but I have seen the highest peak performance in the activitypub.json using JDK 24. It's not consistent though, running the benchmarks back-to-back I see a significant drop in the activitypub.json benchmark. This could be due to the fact that I'm running this on a passively cooled Macbook Air M1. However I find it strange that it only seems to affect the activitypub.json benchmark. It's possible it's just because it's the first benchmark run. I need to do more testing.

Edit: Some quick testing shows changing the order of benchmarks does change the results a bit. I moved the citm_catalog.json benchmark to the first run and it changes the first two benchmarks a bit.

Run 1

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.956k i/100ms
Calculating -------------------------------------
                json     19.496k (± 0.6%) i/s   (51.29 μs/i) -    391.200k in  20.065956s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    84.000 i/100ms
Calculating -------------------------------------
                json    842.971 (± 0.8%) i/s    (1.19 ms/i) -     16.884k in  20.030700s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   184.000 i/100ms
Calculating -------------------------------------
                json      1.810k (± 5.2%) i/s  (552.36 μs/i) -     36.064k in  19.999143s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.441k i/100ms
Calculating -------------------------------------
                json     24.268k (± 1.0%) i/s   (41.21 μs/i) -    485.759k in  20.018208s

Run 2

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.394k i/100ms
Calculating -------------------------------------
                json     13.893k (± 3.1%) i/s   (71.98 μs/i) -    277.406k in  19.993002s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    83.000 i/100ms
Calculating -------------------------------------
                json    844.179 (± 1.1%) i/s    (1.18 ms/i) -     16.932k in  20.059934s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   184.000 i/100ms
Calculating -------------------------------------
                json      1.832k (± 5.7%) i/s  (545.97 μs/i) -     36.432k in  20.014787s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.467k i/100ms
Calculating -------------------------------------
                json     24.605k (± 0.7%) i/s   (40.64 μs/i) -    493.400k in  20.053374s

@headius
Copy link
Contributor

headius commented Aug 2, 2025

Those numbers are quite a bit better, albeit unpredictable! Order changes likely indicate that there's some polymorphism interfering with optimization, breaking inlining and falling back on slower calls with less optimization. That's the sort of JIT logging I'm hoping to dig into soon.

Once I can focus on this we have some OpenJDK folks interested in seeing the results and helping us tune things.

@samyron
Copy link
Contributor Author

samyron commented Aug 8, 2025

After reading How we made JSON.stringify more than twice as fast and having the VisualVM results fresh in my mind... I figured we could try segmenting the output buffer into chunks to completely avoid the ensureBuffer calls in the ByteListDirectOutputStream.

I crated two very similar OutputStream classes: SegmentedByteListDirectOutputStream and Segmented2ByteListDirectOutputStream. The former manages a linked list of Segments containing byte[] buffers and the latter contains an 2-dimensional array of byte[][]. The idea of growing the capacity by powers of 2 is to limit the number System.arraycopy's that are needed when toByteListDirect is called.

Here is a screenshot of profiling results using Segmented2ByteListDirectOutputStream from VisualVM:

image

The benchmarks are also much more consistent between invocations.

First run

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.887k i/100ms
Calculating -------------------------------------
                json     18.863k (± 1.8%) i/s   (53.02 μs/i) -    377.400k in  20.015082s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    86.000 i/100ms
Calculating -------------------------------------
                json    849.623 (± 5.1%) i/s    (1.18 ms/i) -     16.942k in  20.024229s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   179.000 i/100ms
Calculating -------------------------------------
                json      1.792k (± 4.5%) i/s  (558.03 μs/i) -     35.800k in  20.042557s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.457k i/100ms
Calculating -------------------------------------
                json     24.571k (± 1.7%) i/s   (40.70 μs/i) -    491.400k in  20.006269s

Second run immediately after the first

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.873k i/100ms
Calculating -------------------------------------
                json     18.933k (± 1.0%) i/s   (52.82 μs/i) -    380.219k in  20.084278s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    83.000 i/100ms
Calculating -------------------------------------
                json    853.193 (± 1.1%) i/s    (1.17 ms/i) -     17.098k in  20.042329s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   181.000 i/100ms
Calculating -------------------------------------
                json      1.809k (± 2.5%) i/s  (552.82 μs/i) -     36.200k in  20.027645s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.442k i/100ms
Calculating -------------------------------------
                json     24.572k (± 0.7%) i/s   (40.70 μs/i) -    493.284k in  20.075724s

Note: The C extension's FBuffer implementation may benefit from segmentation too.

@samyron
Copy link
Contributor Author

samyron commented Aug 11, 2025

I added a SWAR Implementation of basic escape scanning when the Vector-API based implementation is disabled. Performance is quite good.

== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.523k i/100ms
Calculating -------------------------------------
                json     15.252k (± 1.3%) i/s   (65.57 μs/i) -    306.123k in  20.075048s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    77.000 i/100ms
Calculating -------------------------------------
                json    767.053 (± 4.0%) i/s    (1.30 ms/i) -     15.323k in  20.014496s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   169.000 i/100ms
Calculating -------------------------------------
                json      1.710k (± 1.1%) i/s  (584.71 μs/i) -     34.307k in  20.061924s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.340k i/100ms
Calculating -------------------------------------
                json     23.132k (± 4.9%) i/s   (43.23 μs/i) -    460.980k in  20.003270s

@headius
Copy link
Contributor

headius commented Aug 11, 2025

segmenting the output buffer into chunks

What, you mean my super-naïve implementation was not efficient? 😆

This is an excellent improvement! Could you move the stream improvements to a separate PR so it doesn't get tied up with the Vector work? I'd expect we can merge it immediately!

It's on my list to revisit this work this week, now that I'm back from holiday.

I'd also like to feature this work as an example of JRuby's potential on newer JVMs in my upcoming conference talks.

@samyron
Copy link
Contributor Author

samyron commented Aug 13, 2025

segmenting the output buffer into chunks

What, you mean my super-naïve implementation was not efficient? 😆

This is an excellent improvement! Could you move the stream improvements to a separate PR so it doesn't get tied up with the Vector work? I'd expect we can merge it immediately!

It's on my list to revisit this work this week, now that I'm back from holiday.

I'd also like to feature this work as an example of JRuby's potential on newer JVMs in my upcoming conference talks.

@headius Unfortunately the segmented output stream by itself didn't have much (or really any) of a performance impact by itself. The bottleneck is the StringEncoder#encode method, at least in the data I've been benchmarking with. However, when I optimize StringEncoder#encode to call a fastpath encodeBasic or SWAR-based encodeBasicSWAR it does have a performance impact.

See #835.

@headius
Copy link
Contributor

headius commented Aug 14, 2025

@samyron Perhaps the segmented version would show more impact with larger output? We may be chasing our tails here, though... ideally if you are generating tens of megabytes of json you're streaming it somewhere and not buffering it. The ByteList form is just to fulfill the API returning a String if you don't provide an output stream.

The SWAR results are excellent though!

@samyron
Copy link
Contributor Author

samyron commented Aug 18, 2025

@samyron Perhaps the segmented version would show more impact with larger output? We may be chasing our tails here, though... ideally if you are generating tens of megabytes of json you're streaming it somewhere and not buffering it. The ByteList form is just to fulfill the API returning a String if you don't provide an output stream.

The SWAR results are excellent though!

I'm happy to remove the segmented buffer and/or have it disabled by default. It's really up to you. On my M1 Macbook Air it definitely seems to help. It also seems more resilient to changing the order of the benchmarks. We should probably move this conversation over to #835. If that PR is merged, I'm going to rebase this branch so it only has the vectorized string encoder implementation.

@byroot
Copy link
Member

byroot commented Aug 28, 2025

Do you wish to continue this PR, or is it no longer necessary after #835 ?

@headius
Copy link
Contributor

headius commented Aug 28, 2025

@byroot The optimizations in #835 are the non-vector parts of this PR pulled out for merging separately. They will work fine on all versions of JRuby and JVM. The changes here relating to the JVM vector API are still in progress.

@samyron
Copy link
Contributor Author

samyron commented Aug 28, 2025

I will circle back in the next few days to clean this up with #835 having been merged. I believe the only relevant changes are the Rakefile and VectorizedStringEncoder.

This PR will be quite a bit smaller / simpler.

@jatin-bhateja
Copy link

jatin-bhateja commented Sep 8, 2025

As @headius suggested in an offline email, adding a link to my explorations on performance analysis with this patch on x86 targets
https://mail.openjdk.org/pipermail/panama-dev/2025-August/021124.html

@samyron
Copy link
Contributor Author

samyron commented Sep 9, 2025

Benchmarks from my M4 Pro:

SWAR

ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
VectorizedStringEncoder disabled.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.599k i/100ms
Calculating -------------------------------------
                json     26.393k (± 3.2%) i/s   (37.89 μs/i) -    527.597k in  20.011866s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json   134.000 i/100ms
Calculating -------------------------------------
                json      1.321k (± 1.2%) i/s  (756.75 μs/i) -     26.532k in  20.081091s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json   264.000 i/100ms
Calculating -------------------------------------
                json      2.619k (± 1.4%) i/s  (381.88 μs/i) -     52.536k in  20.066346s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json     3.528k i/100ms
Calculating -------------------------------------
                json     34.943k (± 1.0%) i/s   (28.62 μs/i) -    702.072k in  20.093948s

Vectorized

ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djruby.json.useVectorizedBasicEncoder=true' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
jruby: warning: unknown property jruby.json.useVectorizedBasicEncoder
json.ext.VectorizedStringEncoder loaded successfully.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.843k i/100ms
Calculating -------------------------------------
                json     28.349k (± 1.1%) i/s   (35.28 μs/i) -    568.600k in  20.060089s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json   130.000 i/100ms
Calculating -------------------------------------
                json      1.288k (± 1.0%) i/s  (776.29 μs/i) -     25.870k in  20.084602s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json   279.000 i/100ms
Calculating -------------------------------------
                json      2.788k (± 0.9%) i/s  (358.70 μs/i) -     55.800k in  20.017077s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 21.0.8+9-LTS on 21.0.8+9-LTS +jit [arm64-darwin]
Warming up --------------------------------------
                json     3.732k i/100ms
Calculating -------------------------------------
                json     36.195k (± 4.3%) i/s   (27.63 μs/i) -    724.008k in  20.068609s

pos += SP.length();
}

ByteBuffer bb = ByteBuffer.wrap(ptrBytes, 0, len);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have enough data for another full vector, this falls back to a SWAR implementation.

I did try to refactor the SWAR implementation and call a common method for both the SWAR implmenetation and this fallback code. Unfortunately it seemed to affect method inlining and was slower in both the SWAR encoder and the vectorized encoder.

Class<?> vectorizedStringEncoderClass = StringEncoder.class.getClassLoader().loadClass(VECTORIZED_STRING_ENCODER_CLASS);
Constructor<?> vectorizedStringEncoderConstructor = vectorizedStringEncoderClass.getDeclaredConstructor();
scanner = (StringEncoder) vectorizedStringEncoderConstructor.newInstance();
System.out.println(scanner.getClass().getName() + " loaded successfully.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I need to remove this print statement but it's helpful for testing to ensure things get loaded correctly.

System.out.println(scanner.getClass().getName() + " loaded successfully.");
} catch (ClassNotFoundException | NoSuchMethodException | InstantiationException | IllegalAccessException | InvocationTargetException e) {
// Fallback to the StringEncoder if we cannot load the VectorizedStringEncoder.
System.err.println("Failed to load VectorizedStringEncoder, falling back to StringEncoder:");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either be removed or a logger should be used.

@samyron
Copy link
Contributor Author

samyron commented Sep 9, 2025

Results are a lot closer on my M1 Macbook Air.

Vectorized

json % ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djruby.json.useVectorizedBasicEncoder=true' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::arrayBaseOffset has been called by org.jruby.util.StringSupport (file:/Users/scott/.asdf/installs/ruby/jruby-9.4.12.0/lib/jruby.jar)
WARNING: Please consider reporting this to the maintainers of class org.jruby.util.StringSupport
WARNING: sun.misc.Unsafe::arrayBaseOffset will be removed in a future release
jruby: warning: unknown property jruby.json.useVectorizedBasicEncoder
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by com.kenai.jffi.internal.StubLoader in module org.jruby.dist (file:/Users/scott/.asdf/installs/ruby/jruby-9.4.12.0/lib/jruby.jar)
WARNING: Use --enable-native-access=org.jruby.dist to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled

Ignoring fiddle-1.1.6 because its extensions are not built. Try: gem pristine fiddle --version 1.1.6
Ignoring jruby-launcher-1.1.19-java because its extensions are not built. Try: gem pristine jruby-launcher --version 1.1.19
Ignoring resolv-0.6.0 because its extensions are not built. Try: gem pristine resolv --version 0.6.0
json.ext.VectorizedStringEncoder loaded successfully.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.873k i/100ms
Calculating -------------------------------------
                json     18.676k (± 0.7%) i/s   (53.55 μs/i) -    374.600k in  20.059265s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    88.000 i/100ms
Calculating -------------------------------------
                json    880.837 (± 1.0%) i/s    (1.14 ms/i) -     17.688k in  20.083042s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   189.000 i/100ms
Calculating -------------------------------------
                json      1.895k (± 0.7%) i/s  (527.57 μs/i) -     37.989k in  20.042787s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.576k i/100ms
Calculating -------------------------------------
                json     25.811k (± 1.3%) i/s   (38.74 μs/i) -    517.776k in  20.064383s

SWAR

ONLY=json JAVA_OPTS='--add-modules jdk.incubator.vector -Djruby.json.useVectorizedBasicEncoder=false' ruby -I"lib" benchmark/encoder-realworld.rb
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::arrayBaseOffset has been called by org.jruby.util.StringSupport (file:/Users/scott/.asdf/installs/ruby/jruby-9.4.12.0/lib/jruby.jar)
WARNING: Please consider reporting this to the maintainers of class org.jruby.util.StringSupport
WARNING: sun.misc.Unsafe::arrayBaseOffset will be removed in a future release
jruby: warning: unknown property jruby.json.useVectorizedBasicEncoder
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by com.kenai.jffi.internal.StubLoader in module org.jruby.dist (file:/Users/scott/.asdf/installs/ruby/jruby-9.4.12.0/lib/jruby.jar)
WARNING: Use --enable-native-access=org.jruby.dist to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled

Ignoring fiddle-1.1.6 because its extensions are not built. Try: gem pristine fiddle --version 1.1.6
Ignoring jruby-launcher-1.1.19-java because its extensions are not built. Try: gem pristine jruby-launcher --version 1.1.19
Ignoring resolv-0.6.0 because its extensions are not built. Try: gem pristine resolv --version 0.6.0
VectorizedStringEncoder disabled.
== Encoding activitypub.json (52595 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     1.789k i/100ms
Calculating -------------------------------------
                json     17.896k (± 0.7%) i/s   (55.88 μs/i) -    357.800k in  19.994788s

== Encoding citm_catalog.json (500298 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json    89.000 i/100ms
Calculating -------------------------------------
                json    899.762 (± 0.8%) i/s    (1.11 ms/i) -     18.067k in  20.080997s

== Encoding twitter.json (466906 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json   189.000 i/100ms
Calculating -------------------------------------
                json      1.892k (± 1.5%) i/s  (528.61 μs/i) -     37.989k in  20.086657s

== Encoding ohai.json (20147 bytes)
jruby 9.4.12.0 (3.1.4) 2025-02-11 f4ab75096a OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +jit [arm64-darwin]
Warming up --------------------------------------
                json     2.673k i/100ms
Calculating -------------------------------------
                json     26.595k (± 0.9%) i/s   (37.60 μs/i) -    531.927k in  20.002217s

@samyron
Copy link
Contributor Author

samyron commented Oct 6, 2025

@headius would love your thoughts on this. I'm pretty happy with this at this point. There might be a few System.out.println's that need to be removed but otherwise I'd love your feedback.

@headius
Copy link
Contributor

headius commented Oct 6, 2025

@samyron I'm having a look now and trying it out locally. Give me a bit to play with these final patches and we'll sort out what to do about the printlns.

@headius
Copy link
Contributor

headius commented Oct 7, 2025

It looks good to me! I'm going to try another benchmark on a Linux machine, because I only got a small gain on macos... small enough I wasn't sure it's actually doing anything.

@headius
Copy link
Contributor

headius commented Oct 8, 2025

I see big gains on the two UTF-8 benchmarks and small gains (or same) on most of the others:

vector=false

== Encoding mixed utf8 (5003001 bytes)
jruby 10.0.3.0-SNAPSHOT (3.4.5) 2025-10-06 ae8cbb2040 OpenJDK 64-Bit Server VM 25+36-Ubuntu-124.04.2 on 25+36-Ubuntu-124.04.2 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    52.000 i/100ms
Calculating -------------------------------------
                json    525.577 (± 1.9%) i/s    (1.90 ms/i) -      2.652k in   5.047991s

== Encoding mostly utf8 (5001001 bytes)
jruby 10.0.3.0-SNAPSHOT (3.4.5) 2025-10-06 ae8cbb2040 OpenJDK 64-Bit Server VM 25+36-Ubuntu-124.04.2 on 25+36-Ubuntu-124.04.2 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    52.000 i/100ms
Calculating -------------------------------------
                json    525.870 (± 1.0%) i/s    (1.90 ms/i) -      2.652k in   5.043552s

vector=true

== Encoding mixed utf8 (5003001 bytes)
jruby 10.0.3.0-SNAPSHOT (3.4.5) 2025-10-06 ae8cbb2040 OpenJDK 64-Bit Server VM 25+36-Ubuntu-124.04.2 on 25+36-Ubuntu-124.04.2 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    80.000 i/100ms
Calculating -------------------------------------
                json    791.729 (± 1.6%) i/s    (1.26 ms/i) -      4.000k in   5.053708s

== Encoding mostly utf8 (5001001 bytes)
jruby 10.0.3.0-SNAPSHOT (3.4.5) 2025-10-06 ae8cbb2040 OpenJDK 64-Bit Server VM 25+36-Ubuntu-124.04.2 on 25+36-Ubuntu-124.04.2 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    81.000 i/100ms
Calculating -------------------------------------
                json    794.259 (± 0.9%) i/s    (1.26 ms/i) -      3.969k in   4.997479s

I think this is good to merge now and we can continue to refine it from there (including doing some more formal logging to replace the debug println's. cc @byroot

@headius
Copy link
Contributor

headius commented Oct 8, 2025

@byroot Let us know if you want anything further done on this before merging. I have some other PRs that will address minor issues here and improve the auto-configuration of the vector API.

@byroot
Copy link
Member

byroot commented Oct 9, 2025

Let us know if you want anything further done on this before merging.

Just some cleanup. e.g. no println, no commented out code and a clean git history (no WIP commits, commit messages that make sense, etc. Just squashing into a single commit is fine too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants