Enable LTO to all targets, and apply gfx1151 specific configurations#88
Enable LTO to all targets, and apply gfx1151 specific configurations#88osimarr wants to merge 1 commit into
Conversation
|
I'm sending this as draft as request for comment, because I'm not sure if these changes can be applied to other targets or to windows as well, as my environment test is limited to Linux / gfx1151 |
|
These are the benchmarks with b8895 preview (IDK why it says build 8893 there, maybe because 8895 is still preview?) $ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99
build: 6217b4958 (8893) |
|
Enabling this flag also seems to resolve some VRAM retention issues. @superm1 Thoughts on whether this should be enabled for all devices? |
llama-bench shows 15%-20% gain in token generation on gfx1151 when applying these specific llamacpp build configs. These are benchmarks with this patch: $ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | pp512 | 617.84 ± 4.64 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | tg128 | 39.22 ± 0.02 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | pp512 | 612.25 ± 5.97 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | tg128 | 39.57 ± 0.03 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | pp512 | 558.94 ± 97.46 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | tg128 | 37.77 ± 0.16 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | pp512 | 618.13 ± 7.40 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | tg128 | 38.13 ± 0.05 | These are benchmarks before the patch: $ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | pp512 | 600.81 ± 20.08 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 0 | tg128 | 32.85 ± 0.35 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | pp512 | 605.97 ± 12.16 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 0 | tg128 | 33.27 ± 0.37 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | pp512 | 607.84 ± 9.18 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 0 | 1 | tg128 | 32.47 ± 0.05 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | pp512 | 618.88 ± 13.58 | | qwen3next 80B.A3B Q5_K - Medium | 55.45 GiB | 79.67 B | ROCm | 99 | 1 | 1 | tg128 | 33.10 ± 0.06 |
|
I made the patch more conservative and enabled just OPENMP, but for all targets |
|
@slojosic-amd @superm1 Thoughts about setting |
|
Second gfx1151 data point as you asked for in your draft note. Built and ran your exact bench command on a different Strix Halo box — the flag is fine but the Setup
Results
So the flag doesn't regress anything here — and What might be different between usA few guesses I can't narrow down from my side:
What CPU are you running, and do you have any Full bench logs + the build/bench scripts: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/lemonade-pr-88-validation |
danielholanda
left a comment
There was a problem hiding this comment.
Approving as data shows that the flag is broadly safe and the upstream llama.cpp default.
That said, since the benefits here are fairly limited, we may undo this change in the future if any issues arise.
llama-bench shows 15%-20% gain in token generation on gfx1151 when applying these specific llamacpp build configs.
These are benchmarks with this patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB
These are benchmarks before the patch:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LD_LIBRARY_PATH=. ./llama-bench -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL -fa 0,1 --mmap 0,1 -ngl 99 ggml_cuda_init: found 1 ROCm devices (Total VRAM: 64042 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 64042 MiB