Skip to content

rocm - rebased on top of the current main branch, nix build, changes to the rocm version of the kernel#290

Closed
alantsev wants to merge 202 commits into
antirez:rocmfrom
alantsev:rocm
Closed

rocm - rebased on top of the current main branch, nix build, changes to the rocm version of the kernel#290
alantsev wants to merge 202 commits into
antirez:rocmfrom
alantsev:rocm

Conversation

@alantsev
Copy link
Copy Markdown
Contributor

@alantsev alantsev commented May 30, 2026

this PR (next iteration over the #180)

  • enables wmma indexer for the rocm path
  • introduces nix build configuration
  • introduces gfx1151 specific optimisation

the last change changes the order of reduction (sum) from well-defined (in cuda, from small to large I assume), to not well defined - to mitigate this I disabled fast math options at the rocm build path.

I will reenable these options later after implementing proper version of the optimised kernel.

All tests passed. The evaluation and agent flows feel comparable to the baseline.
I cannot see any obvious problems.

Probably it should not be merged to the upstream until the issues above are resolved (i.e deterministic kernel + minimal changes to the vanilla ds4_cuda.cu code etc).

However, it increases generation throughput from ~8 t/s to ~11 t/s on gfx1151, so I decided to share it with the community.

mitsuhiko and others added 30 commits May 11, 2026 12:30
Implements the Responses API endpoint that Codex CLI (and other modern
OpenAI tooling) speaks instead of /v1/chat/completions. The wire format
is documented in OpenAI's Responses API; this implementation has been
iterated against the Codex CLI binary's SSE parser shape until no
remaining schema gaps were found.

Request parsing (parse_responses_request, parse_responses_input):
- Accepts the typed input array (message, function_call,
  function_call_output, reasoning, custom_tool_call(_output),
  local_shell_call(_output), web_search_call(_output),
  tool_search_call(_output), image_generation_call(_output),
  compaction, context_compaction).
- Maps hosted-tool history to function_call/function_call_output so
  prior actions survive across turns; rejects unknown item types and
  non-completed status with 400 to avoid silent context loss.
- Strict content-array parsing: only string|null|array of recognized
  text blocks (input_text/output_text/text/summary_text/
  reasoning_text); rejects non-text modalities (input_image/file/
  audio) instead of accepting an empty prompt.
- Merges adjacent function_call items into the preceding assistant
  message so text + tool-call turns render as a single assistant
  block.
- Honors reasoning.effort (incl. "minimal"/"none") and gates
  reasoning summary surface on reasoning.summary opt-in.
- Rejects previous_response_id, conversation, and forced tool_choice
  explicitly (constrained decoding / persisted state not supported).

Output (responses_sse_*, responses_final_response):
- Emits the full streaming lifecycle: response.created,
  output_item.added/.done, reasoning_summary_part.added/.done,
  reasoning_summary_text.delta/.done, content_part.added/.done,
  output_text.delta/.done, function_call_arguments.delta/.done,
  response.completed.
- Branches the terminal event by finish reason: response.failed for
  errors and response.incomplete with reason "max_tokens" for length.
- Every event carries sequence_number; every output_text part carries
  annotations:[]; function_call output_item.added ships with an empty
  arguments string (full args arrive via function_call_arguments.done
  and output_item.done), and item ids are stable across added/done.
- Tracks whether </think> was actually observed so a truncated stream
  marks the reasoning item incomplete instead of "completed".
- Recovers gracefully when the DSML tool parse fails after the model
  was suppressed at the tool marker: the suppressed tail is flushed
  as additional output_text deltas so the streamed message matches
  output_item.done.

Tested by 25 rounds of /codex:adversarial-review against the same
client this is meant to feed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks.

Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.
This reverts commit 2a7a5f3.

There was no ack from the user. Don't want to take a fix
that is astronautically produced from an unclear error
trace.
Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth.

Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.
Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation.

The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance.

Tested on 0.180 with:
- make cpu
- make -B cuda-spark
- make cuda-regression
- ./ds4_test --server --metal-kernels
- ./ds4_test --logprob-vectors --tool-call-quality
- ds4-bench ctx-alloc 32768, 250000, and 1000000
- ds4-server --ctx 1000000 startup smoke

(cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)
Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first.

Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations.

Fixes antirez#127.
Return a 400 error with error type "context_exceeded" when prompt tokens exceed
context size. The response includes both n_prompt_tokens and n_ctx fields so
clients can determine exactly why the request failed and how far over the limit
they went.

Error response format:
  {
    "error": {
      "message": "Prompt tokens (N) exceeds context size (M)",
      "type": "context_exceeded",
      "n_prompt_tokens": N,
      "n_ctx": M
    }
  }
dwarfstar is typoed to drawfstar
@alantsev
Copy link
Copy Markdown
Contributor Author

Thanks @fry69 , the PR you mentioned does not have the indexer for rocm platform (unless I misread it) - which is essential for the long context runs.

@alantsev
Copy link
Copy Markdown
Contributor Author

thanks @fry69, let's keep it the way it is.

user and others added 16 commits May 30, 2026 10:21
Add shared help text across the CLI, server, agent, bench, and eval tools. Expand distributed-mode guidance, clean up endpoint naming, and use a TTY-only 256-color layout with clearer section titles, option arguments, separators, examples, and explanatory text.
```
$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
server:
server: OK
ds4 tests: ok
```

```
$ ./ds4-eval -m ds4flash.gguf --plain --questions 12 --tokens 2048 --temp 0 --seed 1
...

PASSED got 16 expected 16 (159.8s, 1437 tokens)
ds4-eval: 10/12 passed, 2 failed, runtime 00h:27m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201     1661     1862 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149      370      519 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      623      704 70       70       AIME2025/aime2025-01
  4 FAILED        313     2048     2361 A        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272     2048     2320 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1325     1471 468      468      AIME2025/aime2025-16
  7 PASSED        156     1303     1459 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127      280      407 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 FAILED        633     2048     2681 26       588      AIME2025/aime2025-02
 10 PASSED        182     1080     1262 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137      232      369 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165     1437     1602 16       16       AIME2025/aime2025-03

```
@alantsev
Copy link
Copy Markdown
Contributor Author

  • rebased on to of the current main
  • changed the specific kernel to sum up values from the smallest (roughly, still using warp sum)
  • the "no-fast-math" switch is still necessary

after this rebase the build is significantly slower than before
the prefil value dropped from 80+ t/s to 60+ t/s
the generation is 11+ t/s still

the run is quite stable now.

the recent eval run (until the first failure)

$ ./ds4-eval --nothink --temp 3 --min-p 0.3
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.314s
ds4: cuda backend initialized for graph diagnostics
ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000)
ds4-eval: model shape DeepSeek V4 Flash
ds4-eval: context buffers 718.95 MiB (ctx=16777, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=4196)
ds4-eval: 10/92 passed, 1 failed, runtime 00h:08m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201      309      510 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149       35      184 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      268      349 70       70       AIME2025/aime2025-01
  4 PASSED        313      186      499 C        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272      156      428 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146      998     1144 468      468      AIME2025/aime2025-16
  7 PASSED        156      573      729 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127       54      181 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 PASSED        633     1687     2320 588      588      AIME2025/aime2025-02
 10 PASSED        182      407      589 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 FAILED        137       93      230 E        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 STOPPED       165      329      494 -        16       AIME2025/aime2025-03

the benchmark

$ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.314s
ds4: cuda backend initialized for graph diagnostics
ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418)
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,66.52,128,11.32,52184460
4096,2048,65.09,128,9.81,80373132
6144,2048,64.84,128,9.76,108561804

the test

$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.315s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
local-golden-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s
ds4: cuda backend initialized for graph diagnostics
ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=5/5 top20_overlap=17/20 top64_overlap=54/64 top20_max_abs=2.37654
local-golden-vectors: OK
metal-short-prefill:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.347s
ds4: cuda backend initialized for graph diagnostics
metal-short-prefill: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
metal-tensor-equivalence:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence candidate route=auto
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.347s
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max
_abs=0
ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.2082 cand=-15.2082 abs=0 id=1 ref=20.2563 cand=20.2563 abs=0 id=2 ref=-5
6.211 cand=-56.211 abs=0 id=3 ref=18.3336 cand=18.3336 abs=0 id=4 ref=26.8736 cand=26.8736 abs=0
ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_
abs=0
ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-1.58617 cand=-1.58617 abs=0 id=1 ref=22.1528 cand=22.1528 abs=0 id=2 ref
=-48.1143 cand=-48.1143 abs=0 id=3 ref=11.2723 cand=11.2723 abs=0 id=4 ref=26.9571 cand=26.9571 abs=0
ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_ab
s=0
ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-1.93701 cand=-1.93701 abs=0 id=1 ref=24.0429 cand=24.0429 abs=0 id=2 ref
=-43.3039 cand=-43.3039 abs=0 id=3 ref=15.4284 cand=15.4284 abs=0 id=4 ref=18.4026 cand=18.4026 abs=0
ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=4/5 overlap=18/20 max_rank_delta=5 rms=0.74093 max_abs=4.6312
9 top20_max_abs=3.4231
ds4-test: Tensor equivalence long_memory_archive largest deltas: id=124155 ref=-30.9718 cand=-26.3405 abs=4.63129 id=107551 ref=9.72954 cand=6.04532
abs=3.68421 id=3736 ref=8.53016 cand=4.85223 abs=3.67793 id=78660 ref=-34.4307 cand=-30.7815 abs=3.64919 id=17661 ref=5.76155 cand=9.4065 abs=3.64495
ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=4/5 overlap=16/20 max_rank_delta=3 rms=0.482428 max_abs=2.78742 top20
_max_abs=1.17341
ds4-test: Tensor equivalence long_code_audit largest deltas: id=60846 ref=-7.51363 cand=-4.72621 abs=2.78742 id=127145 ref=1.48636 cand=3.62653 abs=2
.14017 id=15707 ref=11.8974 cand=14.0267 abs=2.12925 id=44276 ref=6.6181 cand=8.69801 abs=2.07991 id=41594 ref=-3.83239 cand=-1.77186 abs=2.06053
ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=4/5 min_overlap=16/20 worst_r
ank_delta=5 worst_rms=0.74093 worst_max_abs=4.63129 worst_top20_max_abs=3.4231
metal-tensor-equivalence: OK
server:
server: OK
ds4 tests: ok


- reved unnecessary permutation in the newly introduced kernel - it did
  not provided necesssary guaarantees in proper order of sumation anyway
- moved it out of the ds3_cuda.cu file
- introduced envvar to control it's execution on rocm
- reverted the build changes back to use fastmath - to avoid masking the
  problem

  (even with fast math disbled there are about 6% chance of failure for
  the --metal-tensor-equivalence test)

the current benchmark run

```
$ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.454s
ds4: cuda backend initialized for graph diagnostics
ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418)
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,66.75,128,11.75,52184460
4096,2048,65.24,128,10.11,80373132

```

the current version failes the --logprob-vectors test
(it generates "```C\nreturn" instead of expected ```c\nreturn)
```
$ ./ds4_test --logprob-vectors
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
chands4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.315s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_code_completion step 1 selected token mismatch
tests/ds4_test.c:808: assertion failed: false
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: ERR
ds4 tests: 1 failure(s)

```

the evaluation run is the same.

the ds3_cuda.cu code changes comparing to the main@upstream branch is minimal
@alantsev
Copy link
Copy Markdown
Contributor Author

Simplified the code

  • removed unnecessary permutation in the newly introduced kernel
  • moved it out of the ds3_cuda.cu file
  • introduced envvar to control its execution on rocm
  • reverted the build changes back to use fastmath - to avoid masking the problem

the ds3_cuda.cu code changes this way are minimal comparing to the main@upstream

the current version with fastmath fails the --logprob-vectors test (always)
(it generates "C\nreturn" instead of expected c\nreturn)
(and metal vectors equivalence tests sometimes)

$ ./ds4_test --logprob-vectors
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
chands4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.315s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_code_completion step 1 selected token mismatch
tests/ds4_test.c:808: assertion failed: false
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: ERR
ds4 tests: 1 failure(s)

@alantsev
Copy link
Copy Markdown
Contributor Author

alantsev commented May 31, 2026

note: for anyone who decide to use it you will need to restore the "-fnofast-math -fno-finite-math-only -fsigned-zeros -ffp-contract=off" flags back - without these the errors accumulate too fast

@alantsev
Copy link
Copy Markdown
Contributor Author

I think I found the source of the error - you need to use a precise math functions (like metal does) on the critical paths - such as expert selection kernels. I will close this PR and will submit the proper one.

@alantsev alantsev closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.