Skip to content

Conversation

@siladu
Copy link
Contributor

@siladu siladu commented Oct 31, 2025

Changes

Core changes by @ivokub

  1. Memory Pooling for Performance

Introduces sync.Pool objects to reuse allocations and reduce garbage collection pressure:

  • bigIntPool - reuses big.Int allocations for scalar operations
  • g1Pool / g2Pool - reuses elliptic curve point allocations
  • bytes64Pool - reuses 64-byte buffer allocations
  1. Simplified Error Handling
  • Before: Functions returned error strings passed through buffers between Go and Java
  • After: Functions return integer error codes (0 for success, 1-8 for various errors)
  • Removes overhead of string allocation and copying across JNI boundary
  1. Streamlined JNI Interface

Changes function signatures from:
func eip196altbn128G1Add(input, output, errorBuf *C.char, inputLen C.int,
outputLen, errorLen *C.int) C.int
To:
func eip196altbn128G1Add(input, output *C.char, inputLen C.int) errorCode

  1. Optimized Field Validation
  • Removes manual field checking (checkInFieldEIP196 function)
  • Uses SetBytesCanonical() which performs validation internally
  • Eliminates redundant modulus comparisons
  1. Direct Encoding
  • g1AffineEncode now works directly with point objects using RawBytes()
  • Eliminates intermediate Marshal() allocations
  1. Reduced Buffer Sizes
  • Result buffer size reduced from 128 to 64 bytes (only needs to hold one G1 point)
  • Removes 256-byte error buffer entirely

Results

Before this PR

                               |  Actual cost | Derived Cost |  Iteration time |      Throughput
EcAdd                          |      150 gas |      162 gas |      1,618.3 ns |      98.35 ±1.62 MGps
EcAddMarius                    |      150 gas |      417 gas |      4,173.7 ns |      36.87 ±0.36 MGps
EcAddAmez1                     |      150 gas |      402 gas |      4,019.2 ns |      37.66 ±0.29 MGps
EcAddAmez2                     |      150 gas |      406 gas |      4,055.4 ns |      37.66 ±0.33 MGps
EcAddAmez3                     |      150 gas |      407 gas |      4,074.4 ns |      37.81 ±0.38 MGps
EcAddCase0                     |      150 gas |      419 gas |      4,185.6 ns |      37.71 ±0.37 MGps
EcAddCase1                     |      150 gas |      404 gas |      4,040.7 ns |      37.69 ±0.33 MGps
...
EcAddCase100                   |      150 gas |      158 gas |      1,584.8 ns |     101.14 ±1.51 MGps
EcAddCase106                   |      150 gas |      141 gas |      1,412.5 ns |     110.12 ±1.73 MGps
mul1                           |    6,000 gas |    5,191 gas |     51,909.7 ns |     115.89 ±0.56 MGps
mul2                           |    6,000 gas |    5,148 gas |     51,479.8 ns |     117.04 ±0.67 MGps
2 pairings                     |   79,000 gas |   44,288 gas |    442,882.6 ns |     178.45 ±0.37 MGps
4 pairings                     |  113,000 gas |   62,853 gas |    628,526.7 ns |     179.86 ±0.35 MGps
6 pairings                     |  147,000 gas |   81,424 gas |    814,240.0 ns |     180.58 ±0.28 MGps

This PR

                              |  Actual cost | Derived Cost |  Iteration time |      Throughput
EcAdd                          |      150 gas |       76 gas |        760.4 ns |     212.72 ±2.97 MGps
EcAddMarius                    |      150 gas |      333 gas |      3,327.3 ns |      45.85 ±0.39 MGps
EcAddAmez1                     |      150 gas |      328 gas |      3,275.2 ns |      47.25 ±0.44 MGps
EcAddAmez2                     |      150 gas |      328 gas |      3,277.9 ns |      46.64 ±0.42 MGps
EcAddAmez3                     |      150 gas |      323 gas |      3,225.8 ns |      47.07 ±0.35 MGps
EcAddCase0                     |      150 gas |      322 gas |      3,225.0 ns |      47.51 ±0.36 MGps
EcAddCase1                     |      150 gas |      320 gas |      3,204.4 ns |      47.64 ±0.40 MGps
...
EcAddCase100                   |      150 gas |       71 gas |        709.3 ns |     223.52 ±2.95 MGps
EcAddCase106                   |      150 gas |       62 gas |        617.4 ns |     259.65 ±3.74 MGps
mul1                           |    6,000 gas |    5,235 gas |     52,345.7 ns |     115.08 ±0.65 MGps
mul2                           |    6,000 gas |    5,178 gas |     51,777.6 ns |     116.35 ±0.65 MGps
2 pairings                     |   79,000 gas |   44,050 gas |    440,497.9 ns |     179.42 ±0.37 MGps
4 pairings                     |  113,000 gas |   62,573 gas |    625,732.9 ns |     180.65 ±0.35 MGps
6 pairings                     |  147,000 gas |   81,163 gas |    811,627.4 ns |     181.19 ±0.34 MGps

Benchmark Details

besu-ecadd-warm-exec-invert$ time ./build/install/besu/bin/evmtool benchmark --native --warm-iterations=20000 --exec-iterations=1000 --warm-invert=true altBn128
besu/v25.7-develop-a88105d/linux-x86_64/openjdk-java-21

****************************** Hardware Specs ******************************
* VM Type: m6a.2xlarge
* OS: GNU/Linux Ubuntu 24.04.2 LTS (Noble Numbat) build 6.14.0-1009-aws
* Processor: AMD EPYC 7R13 Processor
* Microarchitecture: Zen 3
* Physical CPU packages: 1
* Physical CPU cores: 4
* Logical CPU cores: 8
* Average Max Frequency per core: 4501 MHz
* Memory Total: 32 GB

Testing

  • passes besu reference tests
  • mainnet sync test
  • gas-benchmarks
  • Fuzzing?

TODO

More testing, e.g. on mainnet, gas-benchmarks

siladu and others added 4 commits October 31, 2025 16:40
1. Memory Pooling for Performance

Introduces sync.Pool objects to reuse allocations and reduce garbage collection pressure:
- bigIntPool - reuses big.Int allocations for scalar operations
- g1Pool / g2Pool - reuses elliptic curve point allocations
- bytes64Pool - reuses 64-byte buffer allocations

2. Simplified Error Handling

- Before: Functions returned error strings passed through buffers between Go and Java
- After: Functions return integer error codes (0 for success, 1-8 for various errors)
- Removes overhead of string allocation and copying across JNI boundary

3. Streamlined JNI Interface

Changes function signatures from:
func eip196altbn128G1Add(input, output, errorBuf *C.char, inputLen C.int,
                         outputLen, errorLen *C.int) C.int
To:
func eip196altbn128G1Add(input, output *C.char, inputLen C.int) errorCode

4. Optimized Field Validation

- Removes manual field checking (checkInFieldEIP196 function)
- Uses SetBytesCanonical() which performs validation internally
- Eliminates redundant modulus comparisons

5. Direct Encoding

- g1AffineEncode now works directly with point objects using RawBytes()
- Eliminates intermediate Marshal() allocations

6. Reduced Buffer Sizes

- Result buffer size reduced from 128 to 64 bytes (only needs to hold one G1 point)
- Removes 256-byte error buffer entirely

Signed-off-by: Simon Dudley <[email protected]>
Co-authored-by: Ivo Kubjas <[email protected]>
Signed-off-by: Simon Dudley <[email protected]>
The pairing function now writes results (0x01 or 0x00) directly to the output buffer and only returns error codes for actual errors, eliminating the previous hack of using an error code to represent a valid pairing result of zero.

Signed-off-by: Simon Dudley <[email protected]>
Signed-off-by: Simon Dudley <[email protected]>
@macfarla macfarla mentioned this pull request Nov 4, 2025
6 tasks
inputBytes.length,
output);

if (errorCode != LibGnarkEIP196.EIP196_ERR_CODE_SUCCESS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call err_code_success return_code_success?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but it stems from this set of related go consts and it's idiomatic to share the same prefix, so would need to change them all to returnCode and most of them are indeed errorCodes https://github.com/hyperledger/besu-native/pull/301/files#diff-9622b17a1165cbfa1780cbc92d116bcbbcb4136daf03dd3d0aa4f9d77373a2ddR35-R41
I'm leaning towards keeping unless you feel strongly to change them all to returnCode...?

Copy link

@ivokub ivokub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend updating gnark-crypto dependency to v0.19.2 (most recent). Most concretely, it contains optimizations for scalar multiplication in case scalars are small.

For a lot of use-cases it can provide significant speedup (Consensys/gnark-crypto#703). It will be less evident due to JNI.

To update:

cd gnark/gnark-jni
go get github.com/consensys/[email protected]
go mod tidy

I built and ran unit tests locally and tests pass. I didn't run evmtool.

Otherwise, the changes look good - I think passing directly the pairing return value is better with my previous approach (by passing it through error code).

assertThat(errorCode).isEqualTo(LibGnarkEIP196.EIP196_ERR_CODE_SUCCESS);
// The key test: byte 31 should have been written by Go code (either 0x00 or 0x01, not 0xFF)
assertThat(output[31]).isNotEqualTo((byte) 0xFF);
assertThat(output[31]).isIn((byte) 0x00, (byte) 0x01);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that rest is 0x00?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one safety concern highlighted

Comment on lines 59 to 80
ret = eip196altbn128G1Add(i, output, err, i_len, o_len, err_len);
ret = eip196altbn128G1Add(i, output, i_len);
break;
case EIP196_MUL_OPERATION_RAW_VALUE:
ret = eip196altbn128G1Mul(i, output, err, i_len, o_len, err_len);
ret = eip196altbn128G1Mul(i, output, i_len);
break;
case EIP196_PAIR_OPERATION_RAW_VALUE:
ret = eip196altbn128Pairing(i, output, err, i_len, o_len, err_len);
ret = eip196altbn128Pairing(i, output, i_len);
// Result is already written to output buffer by Go
// ret is only non-zero for actual errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the string error and error length are 👍 . Some test fixtures in the csv may or may not need to be updated to reflect the new error response.

I want to call out however, that removing the output length makes the raw JNA -> go interface unsafe. We are loading this library with JNA, but using jna direct mapping. Direct mapping means that we don't have a proxy that is doing marshalling and bounds checking.

Removing the output buffer size and checking may be more expedient, but it opens us up to jvm crashes if somebody in the future makes an unsafe change, like sending an undersized or uninitialized output buffer. If it turns out that this optimization is worth the risk, we should add some big scary comments in the jni wrapper and/or explicit output parameters types for each function so we cannot accidentally send an undersized output buffer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - right now we're really hoping here that the i and output have correct lengths and are properly initialized. And considering the visibility is public, then in essence anyone can call it.

The main performance problem was with the IntByReference type which was used to send the actual sizes of the arrays through FFI (and it had significant overhead). However, I think in this case it would be sufficient if we would do the length-checks inside your referred code? Then we would fail early before calling into JNA, avoiding segfaults and JVM crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siladu siladu force-pushed the improve-ecadd-perf branch from 14b9a14 to d43a197 Compare November 13, 2025 13:42
@siladu
Copy link
Contributor Author

siladu commented Nov 13, 2025

Rerun of benchmark with gnark-crypto v0.19.2 bump

                               |  Actual cost | Derived Cost |  Iteration time |      Throughput
EcAdd                          |      150 gas |       72 gas |        717.7 ns |     215.21 ±1.54 MGps
EcAddMarius                    |      150 gas |      324 gas |      3,240.1 ns |      46.92 ±0.30 MGps
EcAddAmez1                     |      150 gas |      319 gas |      3,193.3 ns |      48.09 ±0.31 MGps
EcAddAmez2                     |      150 gas |      313 gas |      3,131.4 ns |      48.17 ±0.21 MGps
EcAddAmez3                     |      150 gas |      315 gas |      3,152.9 ns |      48.09 ±0.26 MGps
EcAddCase0                     |      150 gas |      312 gas |      3,124.0 ns |      48.42 ±0.25 MGps
EcAddCase1                     |      150 gas |      313 gas |      3,127.4 ns |      48.45 ±0.26 MGps
...
EcAddCase100                   |      150 gas |       66 gas |        661.2 ns |     227.69 ±1.20 MGps
EcAddCase106                   |      150 gas |       59 gas |        590.4 ns |     265.09 ±2.25 MGps
mul1                           |    6,000 gas |    4,759 gas |     47,586.2 ns |     126.56 ±0.70 MGps
mul2                           |    6,000 gas |    4,712 gas |     47,121.1 ns |     128.01 ±0.81 MGps
2 pairings                     |   79,000 gas |   43,924 gas |    439,235.3 ns |     179.93 ±0.35 MGps
4 pairings                     |  113,000 gas |   62,278 gas |    622,780.7 ns |     181.51 ±0.33 MGps
6 pairings                     |  147,000 gas |   80,592 gas |    805,917.8 ns |     182.44 ±0.29 MGps

Seems to give a small improvement to EcAdd (~1 MGas/s) and mul, but not pairings. Not checked if the benching include the small scalars that might benefit the most.

Comment on lines +268 to +269
if !g1.IsOnCurve() {
return errCodePointOnCurveCheckFailedEIP196
Copy link
Contributor

@garyschulte garyschulte Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, if we are trying to eke out additional performance, the isOnCurve() checks are duplicated by SetBytesCanonical since gnark-crypto 0.17.0. Doing duplicate checks were kept out of an abundance of caution. see #262 (comment) for context

it is worth at least testing without the duplicate isOnCurve check to determine if the impact is negligible enough to keep for "visibility" reasons within the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's on my list but would rather keep this as incremental as possible - would save that for another PR.

@siladu siladu force-pushed the improve-ecadd-perf branch from 98f5b5a to b807d30 Compare November 25, 2025 11:12
@siladu
Copy link
Contributor Author

siladu commented Nov 25, 2025

New benchmark with the latest code, i.e. the added length check. Maybe a very slight reduction in throughput, probably cancelled out the gnark-crypto upgrade 😁

                               |  Actual cost | Derived Cost |  Iteration time |      Throughput
EcAdd                          |      150 gas |       80 gas |        803.3 ns |     205.76 ±3.36 MGps
EcAddMarius                    |      150 gas |      334 gas |      3,340.0 ns |      45.95 ±0.46 MGps
EcAddAmez1                     |      150 gas |      328 gas |      3,280.6 ns |      47.02 ±0.50 MGps
EcAddAmez2                     |      150 gas |      327 gas |      3,273.8 ns |      46.88 ±0.48 MGps
EcAddAmez3                     |      150 gas |      337 gas |      3,370.1 ns |      46.84 ±0.56 MGps
EcAddCase0                     |      150 gas |      325 gas |      3,249.5 ns |      47.81 ±0.49 MGps
EcAddCase1                     |      150 gas |      322 gas |      3,220.7 ns |      47.77 ±0.45 MGps
...
EcAddCase100                   |      150 gas |       73 gas |        725.5 ns |     220.15 ±3.28 MGps
EcAddCase106                   |      150 gas |       64 gas |        642.8 ns |     251.34 ±3.94 MGps
mul1                           |    6,000 gas |    4,771 gas |     47,708.4 ns |     126.34 ±0.76 MGps
mul2                           |    6,000 gas |    4,690 gas |     46,902.1 ns |     128.48 ±0.72 MGps
2 pairings                     |   79,000 gas |   43,882 gas |    438,815.8 ns |     180.09 ±0.34 MGps
4 pairings                     |  113,000 gas |   62,162 gas |    621,623.3 ns |     181.85 ±0.33 MGps
6 pairings                     |  147,000 gas |   80,518 gas |    805,178.5 ns |     182.61 ±0.27 MGps

Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see safety comment, otherwise LGTM

Comment on lines 68 to 71
if (output.length < EIP196_PREALLOCATE_FOR_RESULT_BYTES) {
return EIP196_ERR_CODE_INVALID_OUTPUT_LENGTH;
}

Copy link
Contributor

@garyschulte garyschulte Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safe for eip196_perform_operation, which is our 'contract'. The static native entrypoints are still public and potentially unsafe. IDK if these need to remain public.

I will approve, and leave it to your discretion to make them private or protected and/or add some javadoc to the public static native entrypoints that incorrect output length may lead to a jvm crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants