Skip to content

Conversation

@mikepapadim
Copy link
Member

Description

This PR adds two tests classes for testing the computation required for LLM inference.

  1. per unit kernel
  2. as fused kernels

Backend/s tested

Mark the backends affected by this PR.

  • OpenCL
  • PTX
  • SPIRV

OS tested

Mark the OS where this PR is tested.

  • Linux
  • OSx
  • Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

  • Yes
  • No

How to test the new patch?

make 

tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit

tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused

- Implemented RMS normalization, rotary position encoding, and scaled dot-product attention on GPU through TornadoVM.
- Added dedicated methods for matrix-vector operations, key-value caching, and feed-forward processing with SiLU activation.
- Enabled optimized computations for attention and fused feed-forward networks targeting performance improvements.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive testing infrastructure for GPU-accelerated LLM inference kernels, specifically for the Llama3 transformer model. The tests validate individual compute kernels and their fused combinations to ensure numerical correctness across the transformer pipeline.

Key Changes:

  • Adds unit tests for individual transformer operations (RMS normalization, matrix multiplication, RoPE rotation, attention, FFN)
  • Adds integration tests for progressively fused kernel pipelines
  • Implements sequential reference implementations for numerical verification
  • Registers new test classes in the tornado-test script

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 39 comments.

File Description
TestTransformerKernelsUnit.java Individual kernel unit tests with FP32/FP16 variants, covering RMS norm, matmul, RoPE, attention, FFN operations
TestTransformerKernelsFused.java Progressive integration tests building from simple to complete transformer layer execution
GPULlama3Kernels.java GPU kernel implementations for LLM inference including optimized attention, quantization support
tornado-test Registers the two new test classes in the test suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

*
* <p>How to run:</p>
* <code>
* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package name in the documentation comment does not match the actual package. The documentation shows org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit but the actual package is uk.ac.manchester.tornado.unittests.gpullama.

Suggested change
* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsUnit
* tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit

Copilot uses AI. Check for mistakes.
*
* <p>How to run:</p>
* <code>
* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package name in the documentation comment does not match the actual package. The documentation shows org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused but the actual package is uk.ac.manchester.tornado.unittests.gpullama.

Suggested change
* tornado-test -V org.beehive.gpullama3.tornadovm.kernels.TestTransformerKernelsFused
* tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused

Copilot uses AI. Check for mistakes.
if (gid == 0) {
// Combine partial sums from all workgroups
float ss = 0.0f;
for (int i = 1; i <= (size / localMemSize); i++) { // Assuming 8 workgroups
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation size / localMemSize assumes an exact division, but this may not always be the case. If size is not evenly divisible by localMemSize, some workgroups' partial sums will be ignored. This should use (size + localMemSize - 1) / localMemSize to account for all workgroups, or match the actual number of workgroups launched.

Suggested change
for (int i = 1; i <= (size / localMemSize); i++) { // Assuming 8 workgroups
for (int i = 1; i <= ((size + localMemSize - 1) / localMemSize); i++) { // Ensure all workgroups are included

Copilot uses AI. Check for mistakes.
Comment on lines 696 to 706
assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED);

// Print detailed comparison for debugging
System.out.println("Complete transformer layer - sample comparison:");
for (int i = 0; i < Math.min(10, DIM); i++) {
float expected = expectedX.get(i);
float actual = x.get(i);
float diff = Math.abs(expected - actual);
System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff);
}
}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements should be removed from test code or placed behind a conditional flag. These statements print to stdout during test execution which can clutter test output and make it harder to identify actual failures.

Suggested change
assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED);
// Print detailed comparison for debugging
System.out.println("Complete transformer layer - sample comparison:");
for (int i = 0; i < Math.min(10, DIM); i++) {
float expected = expectedX.get(i);
float actual = x.get(i);
float diff = Math.abs(expected - actual);
System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff);
}
}
try {
assertArrayEquals("Stage 8: Complete transformer layer output", expectedX, x, EPSILON_ACCUMULATED);
} catch (AssertionError e) {
// Print detailed comparison for debugging only on failure
System.out.println("Complete transformer layer - sample comparison:");
for (int i = 0; i < Math.min(10, DIM); i++) {
float expected = expectedX.get(i);
float actual = x.get(i);
float diff = Math.abs(expected - actual);
System.out.printf(" x[%d]: expected=%.6f, actual=%.6f, diff=%.6f%n", i, expected, actual, diff);
}
throw e;
}

Copilot uses AI. Check for mistakes.
.task("rmsApplyFFN", GPULlama3Kernels::reductionOneBlock2WithLayer, context, xb, x, rmsFfnWeight, tempFFN)
.task("fusedFFN", GPULlama3Kernels::fusedFeedForwardWithSiLUAndGLUActivation, context, xb, hb, w1, w3, DIM, HIDDEN_DIM, LOCAL_SIZE)
.task("ffnProj", GPULlama3Kernels::matrixVectorGenericWithResidual, context, hb, x, w2, HIDDEN_DIM, DIM, LOCAL_SIZE)
.transferToHost(DataTransferMode.EVERY_EXECUTION, x);
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @formatter:on comment after the task graph definition. This is inconsistent with the formatting pattern used throughout the rest of the file (see lines 623, 688, 806 for examples).

Suggested change
.transferToHost(DataTransferMode.EVERY_EXECUTION, x);
.transferToHost(DataTransferMode.EVERY_EXECUTION, x);
// @formatter:on

Copilot uses AI. Check for mistakes.
float sum3 = matrixVectorRowMajorOptimizedQ8_0(context, localWorkGroupSize, x, w3_quants, w3_scales, n);

// Thread 0 in each workgroup writes the final result
if (localId == 0) {
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is always true.

Copilot uses AI. Check for mistakes.
// STEP 1: Calculate attention scores for all timesteps
for (int t = 0; t <= pos; t++) {
int kvHeadIdx = h / kvMul;
int keyOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Copilot uses AI. Check for mistakes.
// STEP 1: Calculate attention scores for all timesteps
for (int t = 0; t <= pos; t++) {
int kvHeadIdx = h / kvMul;
int keyOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Copilot uses AI. Check for mistakes.
float weightedSum = 0.0f;
for (int t = 0; t <= pos; t++) {
int kvHeadIdx = h / kvMul;
int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Suggested change
int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
int valueOffset = (int) (loff + ((long) t) * kvDim + ((long) kvHeadIdx) * headSize);

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +341
int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i);
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential overflow in int multiplication before it is converted to long by use in a numeric context.

Suggested change
int valueOffset = (int) (loff + t * kvDim + kvHeadIdx * headSize);
weightedSum += wrapAtt.get(headOffset + t) * value_cache.get(valueOffset + i);
long valueOffset = loff + ((long)t) * kvDim + ((long)kvHeadIdx) * headSize;
weightedSum += wrapAtt.get(headOffset + t) * value_cache.get((int)(valueOffset + i));

Copilot uses AI. Check for mistakes.
- Reformatted task graph definitions for improved readability.
- Removed unused debug code from fused transformer kernel tests.
@mikepapadim mikepapadim requested a review from stratika November 28, 2025 10:31
Copy link
Collaborator

@stratika stratika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments from my side.

  1. on macOS some unit-tests are failing:
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO

  2. regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the assertNotBackend(TornadoVMBackendType.SPIRV);

@mikepapadim
Copy link
Member Author

some comments from my side.

  1. on macOS some unit-tests are failing:
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO
  2. regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the assertNotBackend(TornadoVMBackendType.SPIRV);

These are expected to fail in mac

@stratika
Copy link
Collaborator

some comments from my side.

  1. on macOS some unit-tests are failing:
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer - [WHITELISTED]: NO
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testFullRMSNormalization - [WHITELISTED]: NO
    uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations - [WHITELISTED]: NO
  2. regarding SPIR-V, if the new unit-tests are not expected to run with SPIR-V, we could add the assertNotBackend(TornadoVMBackendType.SPIRV);

These are expected to fail in mac

ok, first why are they expected to fail? is it due to accuracy? and second, then we may need to white-list them because the regression pipeline may fails.

@stratika
Copy link
Collaborator

In PTX on my system, I also got these tests failing with wrong results (kernels are built and run):

tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsFused#testFusedMultipleIterations
tornado-test -V uk.ac.manchester.tornado.unittests.gpullama.TestTransformerKernelsUnit#testReductionOneBlockWithLayer

Can it be due to driver issue? my driver is 535.274.02 and CUDA 12.2.

@mikepapadim
Copy link
Member Author

/rerun all

@github-actions
Copy link

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

@github-actions
Copy link

Workflow rerun success

View Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants