Add testing infrastructure for all compute kernels used in GPULlama3.java #741

mikepapadim · 2025-11-25T13:33:47Z

Description

This PR adds two tests classes for testing the computation required for LLM inference.

per unit kernel
as fused kernels

Backend/s tested

Mark the backends affected by this PR.

OpenCL
PTX
SPIRV

OS tested

Mark the OS where this PR is tested.

Linux
OSx
Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

Yes
No

How to test the new patch?

make 

tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit

tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused

- Implemented RMS normalization, rotary position encoding, and scaled dot-product attention on GPU through TornadoVM. - Added dedicated methods for matrix-vector operations, key-value caching, and feed-forward processing with SiLU activation. - Enabled optimized computations for attention and fused feed-forward networks targeting performance improvements.

Copilot

Pull request overview

This PR adds comprehensive testing infrastructure for GPU-accelerated LLM inference kernels, specifically for the Llama3 transformer model. The tests validate individual compute kernels and their fused combinations to ensure numerical correctness across the transformer pipeline.

Key Changes:

Adds unit tests for individual transformer operations (RMS normalization, matrix multiplication, RoPE rotation, attention, FFN)
Adds integration tests for progressively fused kernel pipelines
Implements sequential reference implementations for numerical verification
Registers new test classes in the tornado-test script

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 39 comments.

File	Description
TestTransformerKernelsUnit.java	Individual kernel unit tests with FP32/FP16 variants, covering RMS norm, matmul, RoPE, attention, FFN operations
TestTransformerKernelsFused.java	Progressive integration tests building from simple to complete transformer layer execution
GPULlama3Kernels.java	GPU kernel implementations for LLM inference including optimized attention, quantization support
tornado-test	Registers the two new test classes in the test suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-25T13:43:05Z

...ts/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/TestTransformerKernelsUnit.java

+ *
+ * <p>How to run:</p>
+ * <code>
+ * tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit


The package name in the documentation comment does not match the actual package. The documentation shows org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit but the actual package is uk.ac.manchester.tornado.unittests.gpullama.

Suggested change

* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit

* tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit

Copilot · 2025-11-25T13:43:05Z

...s/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/TestTransformerKernelsFused.java

+ *
+ * <p>How to run:</p>
+ * <code>
+ * tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused


The package name in the documentation comment does not match the actual package. The documentation shows org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused but the actual package is uk.ac.manchester.tornado.unittests.gpullama.

Suggested change

* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused

* tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused

Copilot · 2025-11-25T13:43:05Z

...do-unittests/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/GPULlama3Kernels.java

+        if (gid == 0) {
+            // Combine partial sums from all workgroups
+            float ss = 0.0f;
+            for (int i = 1; i <= (size / localMemSize); i++) {  // Assuming 8 workgroups


The calculation size / localMemSize assumes an exact division, but this may not always be the case. If size is not evenly divisible by localMemSize, some workgroups' partial sums will be ignored. This should use (size + localMemSize - 1) / localMemSize to account for all workgroups, or match the actual number of workgroups launched.

Suggested change

for (int i = 1; i <= (size / localMemSize); i++) { // Assuming 8 workgroups

for (int i = 1; i <= ((size + localMemSize - 1) / localMemSize); i++) { // Ensure all workgroups are included

Copilot · 2025-11-25T13:43:06Z

...s/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/TestTransformerKernelsFused.java

+        assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED);
+
+        // Print detailed comparison for debugging
+        System.out.println("Complete transformer layer - sample comparison:");
+        for (int i = 0; i < Math.min(10, DIM); i++) {
+            float expected = expectedX.get(i);
+            float actual = x.get(i);
+            float diff = Math.abs(expected - actual);
+            System.out.printf("  x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff);
+        }
+    }


Debug print statements should be removed from test code or placed behind a conditional flag. These statements print to stdout during test execution which can clutter test output and make it harder to identify actual failures.

Suggested change

assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED);

// Print detailed comparison for debugging

System.out.println("Complete transformer layer - sample comparison:");

for (int i = 0; i < Math.min(10, DIM); i++) {

float expected = expectedX.get(i);

float actual = x.get(i);

float diff = Math.abs(expected - actual);

System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff);

}

}

try {

assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED);

} catch (AssertionError e) {

// Print detailed comparison for debugging only on failure

System.out.println("Complete transformer layer - sample comparison:");

for (int i = 0; i < Math.min(10, DIM); i++) {

float expected = expectedX.get(i);

float actual = x.get(i);

float diff = Math.abs(expected - actual);

System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff);

}

throw e;

}

Copilot · 2025-11-25T13:43:06Z

...s/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/TestTransformerKernelsFused.java

+                .task("rmsApplyFFN", GPULlama3Kernels::reductionOneBlock2WithLayer, context, xb, x, rmsFfnWeight, tempFFN)
+                .task("fusedFFN", GPULlama3Kernels::fusedFeedForwardWithSiLUAndGLUActivation, context, xb, hb, w1, w3, DIM, HIDDEN_DIM, LOCAL_SIZE)
+                .task("ffnProj", GPULlama3Kernels::matrixVectorGenericWithResidual, context, hb, x, w2, HIDDEN_DIM, DIM, LOCAL_SIZE)
+                .transferToHost(DataTransferMode.EVERY_EXECUTION, x);


Missing @formatter:on comment after the task graph definition. This is inconsistent with the formatting pattern used throughout the rest of the file (see lines 623, 688, 806 for examples).

Suggested change

.transferToHost(DataTransferMode.EVERY_EXECUTION, x);

.transferToHost(DataTransferMode.EVERY_EXECUTION, x);

// @formatter:on

Copilot · 2025-11-25T13:43:14Z

...do-unittests/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/GPULlama3Kernels.java

+        float sum3 = matrixVectorRowMajorOptimizedQ8_0(context, localWorkGroupSize, x, w3_quants, w3_scales, n);
+
+        // Thread 0 in each workgroup writes the final result
+        if (localId == 0) {


Test is always true.

Copilot · 2025-11-25T13:43:14Z

...do-unittests/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/GPULlama3Kernels.java

+        // STEP 1: Calculate attention scores for all timesteps
+        for (int t = 0; t <= pos; t++) {
+            int kvHeadIdx = h / kvMul;
+            int keyOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);


Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Copilot · 2025-11-25T13:43:14Z

...do-unittests/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/GPULlama3Kernels.java

+        // STEP 1: Calculate attention scores for all timesteps
+        for (int t = 0; t <= pos; t++) {
+            int kvHeadIdx = h / kvMul;
+            int keyOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);


Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Copilot · 2025-11-25T13:43:15Z

...do-unittests/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/GPULlama3Kernels.java

+            float weightedSum = 0.0f;
+            for (int t = 0; t <= pos; t++) {
+                int kvHeadIdx = h / kvMul;
+                int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);


Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Suggested change

int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);

int valueOffset = (int) (loff + ((long) t) * kvDim + ((long) kvHeadIdx) * headSize);

Copilot · 2025-11-25T13:43:15Z

...do-unittests/src/main/java/uk/ac/manchester/tornado/unittests/gpullama/GPULlama3Kernels.java

+                int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
+                weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i);


Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Suggested change

int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);

weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i);

long valueOffset = loff + ((long)t) * kvDim + ((long)kvHeadIdx) * headSize;

weightedSum += wrapAtt.get(headOffset + t) * value_cache.get((int)(valueOffset + i));

- Reformatted task graph definitions for improved readability. - Removed unused debug code from fused transformer kernel tests.

stratika

some comments from my side.

on macOS some unit-tests are failing:
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO
regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the assertNotBackend(TornadoVMBackendType.SPIRV);

mikepapadim · 2025-11-28T11:51:41Z

some comments from my side.

on macOS some unit-tests are failing:
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO

regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the assertNotBackend(TornadoVMBackendType.SPIRV);

These are expected to fail in mac

stratika · 2025-11-28T11:56:01Z

some comments from my side.

on macOS some unit-tests are failing:
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO

regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the assertNotBackend(TornadoVMBackendType.SPIRV);

These are expected to fail in mac

ok, first why are they expected to fail? is it due to accuracy? and second, then we may need to white-list them because the regression pipeline may fails.

stratika · 2025-11-28T12:01:17Z

In PTX on my system, I also got these tests failing with wrong results (kernels are built and run):

tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations
tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer

Can it be due to driver issue? my driver is 535.274.02 and CUDA 12.2.

mikepapadim · 2025-11-28T15:00:21Z

/rerun all

github-actions · 2025-11-28T15:00:33Z

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

github-actions · 2025-11-28T15:00:34Z

✅ Workflow rerun success

View Actions

…lama-testing

mikepapadim added 2 commits November 25, 2025 15:28

Add transformer kernel unit tests to tornado-test

a5ded33

mikepapadim requested review from Copilot, mairooni and orionpapadakis November 25, 2025 13:33

mikepapadim self-assigned this Nov 25, 2025

mikepapadim added the tests label Nov 25, 2025

Copilot started reviewing on behalf of mikepapadim November 25, 2025 13:34 View session

Copilot finished reviewing on behalf of mikepapadim November 25, 2025 13:35

Copilot AI reviewed Nov 25, 2025

View reviewed changes

Refactor unit tests for GPULlama3Kernels

59d53b5

- Reformatted task graph definitions for improved readability. - Removed unused debug code from fused transformer kernel tests.

mikepapadim requested a review from stratika November 28, 2025 10:31

stratika reviewed Nov 28, 2025

View reviewed changes

Merge branch 'master' of github.com:beehive-lab/TornadoVM into feat/l…

bd310df

…lama-testing

	* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit
	* tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit

	for (int i = 1; i <= (size / localMemSize); i++) { // Assuming 8 workgroups
	for (int i = 1; i <= ((size + localMemSize - 1) / localMemSize); i++) { // Ensure all workgroups are included

	.transferToHost(DataTransferMode.EVERY_EXECUTION, x);
	.transferToHost(DataTransferMode.EVERY_EXECUTION, x);
	// @formatter:on

	int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
	int valueOffset = (int) (loff + ((long) t) * kvDim + ((long) kvHeadIdx) * headSize);

		int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
		weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i);

Add testing infrastructure for all compute kernels used in GPULlama3.java #741

Are you sure you want to change the base?

Add testing infrastructure for all compute kernels used in GPULlama3.java #741

Conversation

mikepapadim commented Nov 25, 2025

Description

Backend/s tested

OS tested

Did you check on FPGAs?

How to test the new patch?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

stratika left a comment

Choose a reason for hiding this comment

Uh oh!

mikepapadim commented Nov 28, 2025

Uh oh!

stratika commented Nov 28, 2025

Uh oh!

stratika commented Nov 28, 2025

Uh oh!

mikepapadim commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants