Enable LTO to all targets, and apply gfx1151 specific configurations by osimarr · Pull Request #88 · lemonade-sdk/llamacpp-rocm

osimarr · 2026-04-23T03:02:29Z

llama-bench shows 15%-20% gain in token generation on gfx1151 when applying these specific llamacpp build configs.

These are benchmarks with this patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	0	pp512	617.84 ± 4.64
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	0	tg128	39.22 ± 0.02
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	0	pp512	612.25 ± 5.97
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	0	tg128	39.57 ± 0.03
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	1	pp512	558.94 ± 97.46
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	1	tg128	37.77 ± 0.16
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	1	pp512	618.13 ± 7.40
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	1	tg128	38.13 ± 0.05

These are benchmarks before the patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	0	pp512	600.81 ± 20.08
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	0	tg128	32.85 ± 0.35
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	0	pp512	605.97 ± 12.16
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	0	tg128	33.27 ± 0.37
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	1	pp512	607.84 ± 9.18
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	1	tg128	32.47 ± 0.05
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	1	pp512	618.88 ± 13.58
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	1	tg128	33.10 ± 0.06

osimarr · 2026-04-23T03:04:31Z

I'm sending this as draft as request for comment, because I'm not sure if these changes can be applied to other targets or to windows as well, as my environment test is limited to Linux / gfx1151

osimarr · 2026-04-23T03:56:14Z

These are the benchmarks with b8895 preview (IDK why it says build 8893 there, maybe because 8895 is still preview?)

$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 122880 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 122880 MiB
load_backend: loaded ROCm backend from /home/david/llama-rocm-preview/llama-b8893/libggml-hip.so
load_backend: loaded RPC backend from /home/david/llama-rocm-preview/llama-b8893/libggml-rpc.so
load_backend: loaded CPU backend from /home/david/llama-rocm-preview/llama-b8893/libggml-cpu-zen4.so

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	0	pp512	619.61 ± 2.83
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	0	tg128	39.67 ± 0.02
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	0	pp512	580.36 ± 8.93
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	0	tg128	39.74 ± 0.01
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	1	pp512	613.34 ± 7.54
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	0	1	tg128	39.65 ± 0.01
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	1	pp512	586.95 ± 7.91
qwen3next 80B.A3B Q5_K - Medium	55.45 GiB	79.67 B	ROCm	99	1	1	tg128	39.74 ± 0.00

build: 6217b4958 (8893)

danielholanda · 2026-04-23T13:37:45Z

Enabling this flag also seems to resolve some VRAM retention issues. @superm1 Thoughts on whether this should be enabled for all devices?

llama-bench shows 15%-20% gain in token generation on gfx1151 when applying these specific llamacpp build configs. These are benchmarks with this patch: $ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | pp512 | 617.84 ± 4.64 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | tg128 | 39.22 ± 0.02 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | pp512 | 612.25 ± 5.97 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | tg128 | 39.57 ± 0.03 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | pp512 | 558.94 ± 97.46 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | tg128 | 37.77 ± 0.16 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | pp512 | 618.13 ± 7.40 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | tg128 | 38.13 ± 0.05 | These are benchmarks before the patch: $ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | pp512 | 600.81 ± 20.08 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | tg128 | 32.85 ± 0.35 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | pp512 | 605.97 ± 12.16 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | tg128 | 33.27 ± 0.37 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | pp512 | 607.84 ± 9.18 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | tg128 | 32.47 ± 0.05 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | pp512 | 618.88 ± 13.58 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | tg128 | 33.10 ± 0.06 |

osimarr · 2026-04-29T01:04:15Z

I made the patch more conservative and enabled just OPENMP, but for all targets

danielholanda · 2026-05-14T23:44:57Z

@slojosic-amd @superm1 Thoughts about setting DGGML_OPENMP=ON?

h34v3nzc0dex · 2026-05-23T11:58:46Z

Second gfx1151 data point as you asked for in your draft note. Built and ran your exact bench command on a different Strix Halo box — the flag is fine but the tg128 gain doesn't show up here, only a small pp512 effect.

Setup

Radeon 8060S / gfx1151 / Ryzen AI MAX+ 395 / 128 GiB unified (vs your Total VRAM: 64042 MiB — different BIOS UMA)
Ubuntu 24.04, ROCm 7.1.0 stable + nightly libhsa-runtime64.so.1 overlay (sidesteps the known 7.1.0 HSA null-ptr bug on gfx1151)
Both binaries built from 1acee6bf8 — the exact ggerganov/llama.cpp commit lemonade b1276 ships
Identical CMake flags from .github/workflows/build-llamacpp-rocm.yml, only GGML_OPENMP=OFF/ON differs
ldd confirms: OFF binary has no OpenMP link, ON binary links libomp.so from /opt/rocm-7.1.0/lib/llvm/lib/. Single differing factor.
One Ubuntu-24.04 toolchain workaround applied identically to both (--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13 — 24.04's gcc-14 default lacks the libstdc++ path ROCm 7.1's clang-20 searches; your CI on 22.04 doesn't need it). Identical in both → cancels.

Results

fa	mmap	test	OMP=OFF (tok/s)	OMP=ON (tok/s)	Δ
0	0	pp512	548.50 ± 7.53	577.66 ± 4.42	+5.3%
0	0	tg128	35.64 ± 0.13	35.84 ± 0.03	+0.6%
1	0	pp512	559.53 ± 5.97	565.62 ± 6.89	+1.1%
1	0	tg128	35.85 ± 0.04	35.87 ± 0.02	+0.1%
0	1	pp512	507.00 ± 112.23	562.67 ± 6.32	+11.0% (OFF σ=112, junk)
0	1	tg128	36.10 ± 0.10	36.49 ± 0.02	+1.1%
1	1	pp512	553.21 ± 5.98	563.85 ± 8.78	+1.9%
1	1	tg128	36.16 ± 0.03	35.94 ± 1.50	-0.6%

tg128 mean delta: +0.5% here vs your +17.6%. All four (fa,mmap) tg deltas inside σ. pp512 shows a small consistent +1-5% on the clean rows.

So the flag doesn't regress anything here — and OpenMP=ON is the upstream default anyway — but the +15-20% headline isn't universal across gfx1151 boxes.

What might be different between us

A few guesses I can't narrow down from my side:

ROCm version — your CI is on 7.13 nightly, I'm on 7.1.0 stable. The nightly's HIP runtime might dispatch differently and expose a CPU-side bottleneck I don't have.
BIOS UMA cap — your 64 GiB pool vs my 128 GiB. A 55 GiB model fits in either, but the allocator behaves differently.
CPU core count — OpenMP's win scales with threads. I'm 16c/32t on the Ryzen AI MAX+ 395.

What CPU are you running, and do you have any OMP_NUM_THREADS set in your shell? If yours is also Ryzen AI MAX+ I'd be curious if it's the ROCm version that's doing it.

Full bench logs + the build/bench scripts: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/lemonade-pr-88-validation

danielholanda

Approving as data shows that the flag is broadly safe and the upstream llama.cpp default.

That said, since the benefits here are fairly limited, we may undo this change in the future if any issues arise.

osimarr force-pushed the main branch from d560601 to 1bb6732 Compare April 29, 2026 01:02

osimarr marked this pull request as ready for review April 29, 2026 01:03

danielholanda approved these changes May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable LTO to all targets, and apply gfx1151 specific configurations#88

Enable LTO to all targets, and apply gfx1151 specific configurations#88
osimarr wants to merge 1 commit into
lemonade-sdk:mainfrom
osimarr:main

osimarr commented Apr 23, 2026

Uh oh!

osimarr commented Apr 23, 2026

Uh oh!

osimarr commented Apr 23, 2026

Uh oh!

danielholanda commented Apr 23, 2026

Uh oh!

osimarr commented Apr 29, 2026

Uh oh!

danielholanda commented May 14, 2026

Uh oh!

h34v3nzc0dex commented May 23, 2026

Uh oh!

danielholanda left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

osimarr commented Apr 23, 2026

Uh oh!

osimarr commented Apr 23, 2026

Uh oh!

osimarr commented Apr 23, 2026

Uh oh!

danielholanda commented Apr 23, 2026

Uh oh!

osimarr commented Apr 29, 2026

Uh oh!

danielholanda commented May 14, 2026

Uh oh!

h34v3nzc0dex commented May 23, 2026

Setup

Results

What might be different between us

Uh oh!

danielholanda left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants