Refactor: Move spec outside server by SamuelOliveirads · Pull Request #1949 · ikawrakow/ik_llama.cpp

SamuelOliveirads · 2026-06-10T14:45:30Z

The server currently handles a significant amount of speculative logic, which complicates the implementation and maintenance of new features that aim to replicate this functionality, such as llama-cli or llama-sweep-bench.

The goal here is to migrate as much as possible into the speculative and common modules. To ensure consistency, I conducted benchmarks for several models:

MTP Full (10 runs × 1500 tokens)

Model	Task	Branch	Main	Δ
Qwen 27B n=3	code	66.3 ± 0.3 t/s @ 82.3%	66.4 ± 0.4 t/s @ 83.6%	−0.2%
	extract	59.4 ± 0.2 t/s @ 71.6%	60.4 ± 0.2 t/s @ 74.8%	−1.7%
	story	47.0 ± 0.3 t/s @ 48.8%	44.7 ± 0.3 t/s @ 46.1%	+5.1%
	Overall	57.6 ± 8.1 t/s	57.2 ± 9.3 t/s	+0.7%
Gemma 4 31B n=3	code	63.2 ± 0.3 t/s @ 85.2%	64.1 ± 0.3 t/s @ 85.2%	−1.4%
	extract	61.3 ± 0.4 t/s @ 86.6%	61.8 ± 0.3 t/s @ 86.6%	−0.8%
	story	39.3 ± 0.1 t/s @ 42.0%	39.2 ± 0.1 t/s @ 42.0%	+0.3%
	Overall	54.6 ± 11.0 t/s	55.0 ± 11.4 t/s	−0.7%

Ngram & Draft Smoke (1 run × 1500 tokens)

Model	Method	Branch	Main	Δ
Qwen 27B	ngram-mod	27.3 t/s @ 9.5%	27.5 t/s @ 9.5%	+0.7%
GLM 4.5 Air	draft n=8	17.2 t/s @ 14.5%	17.6 t/s @ 14.5%	+2.3%

ikawrakow · 2026-06-11T11:57:50Z

Generally looks good, but there is still an issue with the main and draft contexts going out of sync.

If I change LOG_DBG to LOG_INF in mtp_update_kv_cache and also print the result of llama_kv_cache_seq_pos_max, here is what I get running the query of @sayap (PR #1894):

[MTP-UPDATE|PROMPT_WARMUP] Updating 1147 tokens for seq_id 0 from pos 0. pos_max = 0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 1146, pos_max = 1146, n_tokens = 1147, size = 149.636 MiB, took 23.95 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140017520398336" timestamp=1781178423 id_slot=0 id_task=0 p0=1147
[MTP-UPDATE|PROMPT_WARMUP] Updating 5 tokens for seq_id 0 from pos 1147. pos_max = 1146
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1153. pos_max = 1152
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1158. pos_max = 1156
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1163. pos_max = 1161
[MTP-UPDATE|GEN_ACCEPTED] Updating 4 tokens for seq_id 0 from pos 1168. pos_max = 1166
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1172. pos_max = 1170
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1177. pos_max = 1175
[MTP-UPDATE|GEN_ACCEPTED] Updating 2 tokens for seq_id 0 from pos 1182. pos_max = 1180
[MTP-UPDATE|GEN_ACCEPTED] Updating 1 tokens for seq_id 0 from pos 1184. pos_max = 1182
[MTP-UPDATE|GEN_ACCEPTED] Updating 2 tokens for seq_id 0 from pos 1185. pos_max = 1182
[MTP-UPDATE|GEN_ACCEPTED] Updating 1 tokens for seq_id 0 from pos 1187. pos_max = 1185
...

I.e., at the second accepted batch we get a gap in the sequence of the MTP context. Not showing here, but if I then also log the drafted tokens, I observe that quite often it drafts the same token two times in a row. This leads to the observed unreasonably low acceptance rate. PR #1894 somehow fixes that, but only if the u-batch size is 512 or 256.

Refactor speculative decoding: move logic outside of server

cbea9b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Move spec outside server#1949

Refactor: Move spec outside server#1949
SamuelOliveirads wants to merge 1 commit into
ikawrakow:mainfrom
SamuelOliveirads:feat/speculative-runtime-core

SamuelOliveirads commented Jun 10, 2026

Uh oh!

ikawrakow commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SamuelOliveirads commented Jun 10, 2026

MTP Full (10 runs × 1500 tokens)

Ngram & Draft Smoke (1 run × 1500 tokens)

Uh oh!

ikawrakow commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants