-
Notifications
You must be signed in to change notification settings - Fork 125
Add testing infrastructure for all compute kernels used in GPULlama3.java #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
- Implemented RMS normalization, rotary position encoding, and scaled dot-product attention on GPU through TornadoVM. - Added dedicated methods for matrix-vector operations, key-value caching, and feed-forward processing with SiLU activation. - Enabled optimized computations for attention and fused feed-forward networks targeting performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive testing infrastructure for GPU-accelerated LLM inference kernels, specifically for the Llama3 transformer model. The tests validate individual compute kernels and their fused combinations to ensure numerical correctness across the transformer pipeline.
Key Changes:
- Adds unit tests for individual transformer operations (RMS normalization, matrix multiplication, RoPE rotation, attention, FFN)
- Adds integration tests for progressively fused kernel pipelines
- Implements sequential reference implementations for numerical verification
- Registers new test classes in the tornado-test script
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 39 comments.
| File | Description |
|---|---|
| TestTransformerKernelsUnit.java | Individual kernel unit tests with FP32/FP16 variants, covering RMS norm, matmul, RoPE, attention, FFN operations |
| TestTransformerKernelsFused.java | Progressive integration tests building from simple to complete transformer layer execution |
| GPULlama3Kernels.java | GPU kernel implementations for LLM inference including optimized attention, quantization support |
| tornado-test | Registers the two new test classes in the test suite |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * | ||
| * <p>How to run:</p> | ||
| * <code> | ||
| * tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package name in the documentation comment does not match the actual package. The documentation shows org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit but the actual package is uk.ac.manchester.tornado.unittests.gpullama.
| * tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit | |
| * tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit |
| * | ||
| * <p>How to run:</p> | ||
| * <code> | ||
| * tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package name in the documentation comment does not match the actual package. The documentation shows org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused but the actual package is uk.ac.manchester.tornado.unittests.gpullama.
| * tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused | |
| * tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused |
| if (gid == 0) { | ||
| // Combine partial sums from all workgroups | ||
| float ss = 0.0f; | ||
| for (int i = 1; i <= (size / localMemSize); i++) { // Assuming 8 workgroups |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation size / localMemSize assumes an exact division, but this may not always be the case. If size is not evenly divisible by localMemSize, some workgroups' partial sums will be ignored. This should use (size + localMemSize - 1) / localMemSize to account for all workgroups, or match the actual number of workgroups launched.
| for (int i = 1; i <= (size / localMemSize); i++) { // Assuming 8 workgroups | |
| for (int i = 1; i <= ((size + localMemSize - 1) / localMemSize); i++) { // Ensure all workgroups are included |
| assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED); | ||
|
|
||
| // Print detailed comparison for debugging | ||
| System.out.println("Complete transformer layer - sample comparison:"); | ||
| for (int i = 0; i < Math.min(10, DIM); i++) { | ||
| float expected = expectedX.get(i); | ||
| float actual = x.get(i); | ||
| float diff = Math.abs(expected - actual); | ||
| System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff); | ||
| } | ||
| } |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements should be removed from test code or placed behind a conditional flag. These statements print to stdout during test execution which can clutter test output and make it harder to identify actual failures.
| assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED); | |
| // Print detailed comparison for debugging | |
| System.out.println("Complete transformer layer - sample comparison:"); | |
| for (int i = 0; i < Math.min(10, DIM); i++) { | |
| float expected = expectedX.get(i); | |
| float actual = x.get(i); | |
| float diff = Math.abs(expected - actual); | |
| System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff); | |
| } | |
| } | |
| try { | |
| assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED); | |
| } catch (AssertionError e) { | |
| // Print detailed comparison for debugging only on failure | |
| System.out.println("Complete transformer layer - sample comparison:"); | |
| for (int i = 0; i < Math.min(10, DIM); i++) { | |
| float expected = expectedX.get(i); | |
| float actual = x.get(i); | |
| float diff = Math.abs(expected - actual); | |
| System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff); | |
| } | |
| throw e; | |
| } |
| .task("rmsApplyFFN", GPULlama3Kernels::reductionOneBlock2WithLayer, context, xb, x, rmsFfnWeight, tempFFN) | ||
| .task("fusedFFN", GPULlama3Kernels::fusedFeedForwardWithSiLUAndGLUActivation, context, xb, hb, w1, w3, DIM, HIDDEN_DIM, LOCAL_SIZE) | ||
| .task("ffnProj", GPULlama3Kernels::matrixVectorGenericWithResidual, context, hb, x, w2, HIDDEN_DIM, DIM, LOCAL_SIZE) | ||
| .transferToHost(DataTransferMode.EVERY_EXECUTION, x); |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing @formatter:on comment after the task graph definition. This is inconsistent with the formatting pattern used throughout the rest of the file (see lines 623, 688, 806 for examples).
| .transferToHost(DataTransferMode.EVERY_EXECUTION, x); | |
| .transferToHost(DataTransferMode.EVERY_EXECUTION, x); | |
| // @formatter:on |
| float sum3 = matrixVectorRowMajorOptimizedQ8_0(context, localWorkGroupSize, x, w3_quants, w3_scales, n); | ||
|
|
||
| // Thread 0 in each workgroup writes the final result | ||
| if (localId == 0) { |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is always true.
| // STEP 1: Calculate attention scores for all timesteps | ||
| for (int t = 0; t <= pos; t++) { | ||
| int kvHeadIdx = h / kvMul; | ||
| int keyOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize); |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential overflow in int multiplication before it is converted to long by use in a numeric context.
| // STEP 1: Calculate attention scores for all timesteps | ||
| for (int t = 0; t <= pos; t++) { | ||
| int kvHeadIdx = h / kvMul; | ||
| int keyOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize); |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential overflow in int multiplication before it is converted to long by use in a numeric context.
| float weightedSum = 0.0f; | ||
| for (int t = 0; t <= pos; t++) { | ||
| int kvHeadIdx = h / kvMul; | ||
| int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize); |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential overflow in int multiplication before it is converted to long by use in a numeric context.
| int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize); | |
| int valueOffset = (int) (loff + ((long) t) * kvDim + ((long) kvHeadIdx) * headSize); |
| int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize); | ||
| weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i); |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential overflow in int multiplication before it is converted to long by use in a numeric context.
| int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize); | |
| weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i); | |
| long valueOffset = loff + ((long)t) * kvDim + ((long)kvHeadIdx) * headSize; | |
| weightedSum += wrapAtt.get(headOffset + t) * value_cache.get((int)(valueOffset + i)); |
- Reformatted task graph definitions for improved readability. - Removed unused debug code from fused transformer kernel tests.
stratika
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments from my side.
-
on
macOSsome unit-tests are failing:
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO -
regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the
assertNotBackend(TornadoVMBackendType.SPIRV);
These are expected to fail in mac |
ok, first why are they expected to fail? is it due to accuracy? and second, then we may need to white-list them because the regression pipeline may fails. |
|
In tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations
tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayerCan it be due to driver issue? my driver is |
|
/rerun all |
|
🚀 Workflow rerun started Mode: |
|
✅ Workflow rerun success |
Description
This PR adds two tests classes for testing the computation required for LLM inference.
Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?