Skip to content

Enable LTO to all targets, and apply gfx1151 specific configurations#88

Open
osimarr wants to merge 1 commit into
lemonade-sdk:mainfrom
osimarr:main
Open

Enable LTO to all targets, and apply gfx1151 specific configurations#88
osimarr wants to merge 1 commit into
lemonade-sdk:mainfrom
osimarr:main

Conversation

@osimarr

@osimarr osimarr commented Apr 23, 2026

Copy link
Copy Markdown

llama-bench shows 15%-20% gain in token generation on gfx1151 when applying these specific llamacpp build configs.

These are benchmarks with this patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB

model size params backend ngl fa mmap test t/s
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 0 pp512 617.84 ± 4.64
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 0 tg128 39.22 ± 0.02
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 0 pp512 612.25 ± 5.97
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 0 tg128 39.57 ± 0.03
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 1 pp512 558.94 ± 97.46
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 1 tg128 37.77 ± 0.16
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 1 pp512 618.13 ± 7.40
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 1 tg128 38.13 ± 0.05

These are benchmarks before the patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB

model size params backend ngl fa mmap test t/s
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 0 pp512 600.81 ± 20.08
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 0 tg128 32.85 ± 0.35
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 0 pp512 605.97 ± 12.16
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 0 tg128 33.27 ± 0.37
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 1 pp512 607.84 ± 9.18
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 1 tg128 32.47 ± 0.05
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 1 pp512 618.88 ± 13.58
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 1 tg128 33.10 ± 0.06

@osimarr

osimarr commented Apr 23, 2026

Copy link
Copy Markdown
Author

I'm sending this as draft as request for comment, because I'm not sure if these changes can be applied to other targets or to windows as well, as my environment test is limited to Linux / gfx1151

@osimarr

osimarr commented Apr 23, 2026

Copy link
Copy Markdown
Author

These are the benchmarks with b8895 preview (IDK why it says build 8893 there, maybe because 8895 is still preview?)

$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 122880 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 122880 MiB
load_backend: loaded ROCm backend from /home/david/llama-rocm-preview/llama-b8893/libggml-hip.so
load_backend: loaded RPC backend from /home/david/llama-rocm-preview/llama-b8893/libggml-rpc.so
load_backend: loaded CPU backend from /home/david/llama-rocm-preview/llama-b8893/libggml-cpu-zen4.so

model size params backend ngl fa mmap test t/s
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 0 pp512 619.61 ± 2.83
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 0 tg128 39.67 ± 0.02
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 0 pp512 580.36 ± 8.93
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 0 tg128 39.74 ± 0.01
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 1 pp512 613.34 ± 7.54
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 0 1 tg128 39.65 ± 0.01
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 1 pp512 586.95 ± 7.91
qwen3next 80B.A3B Q5_K - Medium 55.45 GiB 79.67 B ROCm 99 1 1 tg128 39.74 ± 0.00

build: 6217b4958 (8893)

@danielholanda

Copy link
Copy Markdown
Contributor

Enabling this flag also seems to resolve some VRAM retention issues. @superm1 Thoughts on whether this should be enabled for all devices?

llama-bench shows 15%-20% gain in token generation on gfx1151 when
applying these specific llamacpp build configs.

These are benchmarks with this patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    0 |           pp512 |        617.84 ± 4.64 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    0 |           tg128 |         39.22 ± 0.02 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    0 |           pp512 |        612.25 ± 5.97 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    0 |           tg128 |         39.57 ± 0.03 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    1 |           pp512 |       558.94 ± 97.46 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    1 |           tg128 |         37.77 ± 0.16 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    1 |           pp512 |        618.13 ± 7.40 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    1 |           tg128 |         38.13 ± 0.05 |

These are benchmarks before the patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    0 |           pp512 |       600.81 ± 20.08 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    0 |           tg128 |         32.85 ± 0.35 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    0 |           pp512 |       605.97 ± 12.16 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    0 |           tg128 |         33.27 ± 0.37 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    1 |           pp512 |        607.84 ± 9.18 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  0 |    1 |           tg128 |         32.47 ± 0.05 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    1 |           pp512 |       618.88 ± 13.58 |
| qwen3next 80B.A3B Q5_K - Medium |  55.45 GiB |    79.67 B | ROCm       |  99 |  1 |    1 |           tg128 |         33.10 ± 0.06 |
@osimarr osimarr marked this pull request as ready for review April 29, 2026 01:03
@osimarr

osimarr commented Apr 29, 2026

Copy link
Copy Markdown
Author

I made the patch more conservative and enabled just OPENMP, but for all targets

@danielholanda

Copy link
Copy Markdown
Contributor

@slojosic-amd @superm1 Thoughts about setting DGGML_OPENMP=ON?

@h34v3nzc0dex

Copy link
Copy Markdown

Second gfx1151 data point as you asked for in your draft note. Built and ran your exact bench command on a different Strix Halo box — the flag is fine but the tg128 gain doesn't show up here, only a small pp512 effect.

Setup

  • Radeon 8060S / gfx1151 / Ryzen AI MAX+ 395 / 128 GiB unified (vs your Total VRAM: 64042 MiB — different BIOS UMA)
  • Ubuntu 24.04, ROCm 7.1.0 stable + nightly libhsa-runtime64.so.1 overlay (sidesteps the known 7.1.0 HSA null-ptr bug on gfx1151)
  • Both binaries built from 1acee6bf8 — the exact ggerganov/llama.cpp commit lemonade b1276 ships
  • Identical CMake flags from .github/workflows/build-llamacpp-rocm.yml, only GGML_OPENMP=OFF/ON differs
  • ldd confirms: OFF binary has no OpenMP link, ON binary links libomp.so from /opt/rocm-7.1.0/lib/llvm/lib/. Single differing factor.
  • One Ubuntu-24.04 toolchain workaround applied identically to both (--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13 — 24.04's gcc-14 default lacks the libstdc++ path ROCm 7.1's clang-20 searches; your CI on 22.04 doesn't need it). Identical in both → cancels.

Results

fa mmap test OMP=OFF (tok/s) OMP=ON (tok/s) Δ
0 0 pp512 548.50 ± 7.53 577.66 ± 4.42 +5.3%
0 0 tg128 35.64 ± 0.13 35.84 ± 0.03 +0.6%
1 0 pp512 559.53 ± 5.97 565.62 ± 6.89 +1.1%
1 0 tg128 35.85 ± 0.04 35.87 ± 0.02 +0.1%
0 1 pp512 507.00 ± 112.23 562.67 ± 6.32 +11.0% (OFF σ=112, junk)
0 1 tg128 36.10 ± 0.10 36.49 ± 0.02 +1.1%
1 1 pp512 553.21 ± 5.98 563.85 ± 8.78 +1.9%
1 1 tg128 36.16 ± 0.03 35.94 ± 1.50 -0.6%

tg128 mean delta: +0.5% here vs your +17.6%. All four (fa,mmap) tg deltas inside σ. pp512 shows a small consistent +1-5% on the clean rows.

So the flag doesn't regress anything here — and OpenMP=ON is the upstream default anyway — but the +15-20% headline isn't universal across gfx1151 boxes.

What might be different between us

A few guesses I can't narrow down from my side:

  • ROCm version — your CI is on 7.13 nightly, I'm on 7.1.0 stable. The nightly's HIP runtime might dispatch differently and expose a CPU-side bottleneck I don't have.
  • BIOS UMA cap — your 64 GiB pool vs my 128 GiB. A 55 GiB model fits in either, but the allocator behaves differently.
  • CPU core count — OpenMP's win scales with threads. I'm 16c/32t on the Ryzen AI MAX+ 395.

What CPU are you running, and do you have any OMP_NUM_THREADS set in your shell? If yours is also Ryzen AI MAX+ I'd be curious if it's the ROCm version that's doing it.

Full bench logs + the build/bench scripts: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/lemonade-pr-88-validation

@danielholanda danielholanda left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as data shows that the flag is broadly safe and the upstream llama.cpp default.

That said, since the benefits here are fairly limited, we may undo this change in the future if any issues arise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants