Rocm jaxlib v0.4.35 qa matmulpass #128

zoranjovanovic-ns · 2025-03-11T18:31:38Z

No description provided.

Passing amdgpu targets to crosstool wrapper which calls hipcc can restrict the kernels generated to specific set of supported amdgpu architectures.

Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

[ROCm] Fix kernel launch dimension

ir_emitter/elemental_ir_emitter clean-up will follow. PiperOrigin-RevId: 691766033

…xla/service/gpu/te… Imported from GitHub PR openxla#19484 …sts:gpu_input_fusible_slice_test Copybara import of the project: -- 0d30738 by Dragan Mladjenovic <[email protected]>: [ROCm] Fix //xla/tests:complex_unary_op_test and //xla/service/gpu/tests:gpu_input_fusible_slice_test Merging this change closes openxla#19484 COPYBARA_INTEGRATE_REVIEW=openxla#19484 from ROCm:mlir_tests_new 0d30738 PiperOrigin-RevId: 698374588

Imported from GitHub PR openxla#19426 After this change to the test inputs openxla@b10653f "too many blocks" exception is not getting triggered anymore (shape is not big enough). Due to the low importance of the test, it was decided to disable it. Copybara import of the project: -- ee36ca0 by Milica Makevic <[email protected]>: Disable gpu_too_many_blocks_test for rocm Merging this change closes openxla#19426 COPYBARA_INTEGRATE_REVIEW=openxla#19426 from ROCm:disable_too_many_blocks_test ee36ca0 PiperOrigin-RevId: 697974812

…thm. Now when we have all the pieces of the puzzle for X3 algorithm we could easily add its equivalent for X6. PiperOrigin-RevId: 688294267

Imported from GitHub PR openxla#19342 Triton is currently disabled on ROCm. Skipping the following subtests in `dot_algorithms_test`: - TritonAlgorithmTest.Algorithm_BF16_BF16_F32_X3 - TritonAlgorithmTest.Algorithm_BF16_BF16_F32_X6 - TritonAlgorithmTest.Algorithm_TF32_TF32_F32 - TritonAlgorithmTest.Algorithm_TF32_TF32_F32_X3 - TritonAlgorithmTest.Algorithm_BF16_BF16_F32 Copybara import of the project: -- 32bd775 by Milica Makevic <[email protected]>: Disable unsupported Triton subtests Merging this change closes openxla#19342 COPYBARA_INTEGRATE_REVIEW=openxla#19342 from ROCm:disable_triton_tests 32bd775 PiperOrigin-RevId: 696740956

Rocm jaxlib v0.4.35 qa misc backport

Enable Triton Auto-tuning in XLA

…nSupportedExecutesCorrectlyForDot

Rocm jaxlib v0.4.35 qa triton cleanup

…nels Add NCCL_MAX_NCHANNELS env variable to multi gpu tests

Avoid lazy init of Blas handles, fix for non-canonical dots

Fixed issue with capturing local variable from lambda.

R0.4.35 fix test scripts

…ests-2 Fix Triton related tests

Add gfx1101 support to XLA

Respect hipruntime constraint on max worksize not to exceed int max

This change fixes the flaky gpu compiler test used to run on rocm CI pipeline gate. Triton pipeline was wrongly using the TritonGPUAccelerateMatmul pass which supports cuda only. In rocm there is a different pass which is now used in the rocm pipeline. by Alexandros Theodoridis

ScXfjiang and others added 30 commits November 28, 2024 14:36

fix fp8 in buffer comparator (#73)

d529b07

Add NCCL_MAX_NCHANNELS env variable to multi gpu tests

2aa5cd2

[ROCm] Pass AMDGPU_TARGETS to crosstool wrapper

2e3b40e

Passing amdgpu targets to crosstool wrapper which calls hipcc can restrict the kernels generated to specific set of supported amdgpu architectures.

[ROCm] Fix kernel launch dimension

bb2d621

Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

[ROCm] Enable gemm fusion autotuner.

2d9229a

[ROCm] Fixed an issue with InstructionSchedHintsPass

2bd6f7e

Merge pull request #80 from ROCm/rocm-jaxlib-v0.4.35-qa-fix_launch_dims

e38372e

[ROCm] Fix kernel launch dimension

[XLA:GPU] Remove legacy emitters.

0239e3f

ir_emitter/elemental_ir_emitter clean-up will follow. PiperOrigin-RevId: 691766033

[ROCm] Fix //xla/stream_executor/rocm:rocm_timer_test_gpu_amd_any

790b249

[XLA:GPU] support the hlo rewrite for the dot BF16_BF16_F32_X6 algori…

2c121c1

…thm. Now when we have all the pieces of the puzzle for X3 algorithm we could easily add its equivalent for X6. PiperOrigin-RevId: 688294267

[ROCm] Include clang-19 and clang-20 headers

1572448

Merge pull request #86 from ROCm/rocm-jaxlib-v0.4.35-qa-misc-backport

e9826d8

Rocm jaxlib v0.4.35 qa misc backport

Merge pull request #81 from ROCm/rocm-jaxlib-v0.4.35-qa-triton-autotuner

f55fc88

Enable Triton Auto-tuning in XLA

temp disable f32 for triton_support_legacy_test_gpu_amd_any's IsTrito…

d6d8535

…nSupportedExecutesCorrectlyForDot

gpu_triton_custom_call_test is cuda-only test

184ecb0

register gfx12xx (#90)

dfadfb2

Merge pull request #89 from ROCm/rocm-jaxlib-v0.4.35-qa-triton-cleanup

f4d3ecf

Rocm jaxlib v0.4.35 qa triton cleanup

Merge pull request #77 from ROCm/rocm-jaxlib-v0.4.35-qa_nccl_maxnchan…

b53c56a

…nels Add NCCL_MAX_NCHANNELS env variable to multi gpu tests

Avoid lazy init of Blas handles, fix for non-canonical dots

be9a043

added missing hlo file

e4a89a2

Merge pull request #93 from ROCm/r0.4.35-blas-and-dots-fixes

0cdfcdb

Avoid lazy init of Blas handles, fix for non-canonical dots

Fixed issue with capturing local variable from lambda.

3e81b8f

Merge pull request #101 from ROCm/rocm-jaxlib-v0.4.35-qa-mismatch

77b107a

Fixed issue with capturing local variable from lambda.

Fix test script

bf2730e

Add hostMemRegister to ROCm

7d117dd

Merge pull request #102 from ROCm/r0.4.35-fix-test-scripts

98f331e

R0.4.35 fix test scripts

Fixed Triton support tests.

9ddf294

zoranjovanovic-ns and others added 8 commits February 25, 2025 10:08

Skipped CompilationSucceedsEvenIfKernelWillSpillRegisters test.

656b923

Merge pull request #115 from ROCm/rocm-jaxlib-v0.4.35-qa-triton-fix-t…

1569736

…ests-2 Fix Triton related tests

Respect hipruntime constraint on max worksize not to exceed int max

50327b9

Add gfx1101 support to XLA

de1e2c0

Merge pull request #124 from ROCm/jd-435-gfx11

c9db115

Add gfx1101 support to XLA

Update reduction_mlir.cc

d98bc25

Merge pull request #119 from ROCm/r0.4.35-reduction_mlir_launch_dim_fix

a2e6b1b

Respect hipruntime constraint on max worksize not to exceed int max

zoranjovanovic-ns requested review from i-chaochen and draganmladjenovic March 11, 2025 18:31

i-chaochen approved these changes Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rocm jaxlib v0.4.35 qa matmulpass #128

Rocm jaxlib v0.4.35 qa matmulpass #128

zoranjovanovic-ns commented Mar 11, 2025

Rocm jaxlib v0.4.35 qa matmulpass #128

Are you sure you want to change the base?

Rocm jaxlib v0.4.35 qa matmulpass #128

Conversation

zoranjovanovic-ns commented Mar 11, 2025