Skip to content

Refactor: Move spec outside server#1949

Open
SamuelOliveirads wants to merge 1 commit into
ikawrakow:mainfrom
SamuelOliveirads:feat/speculative-runtime-core
Open

Refactor: Move spec outside server#1949
SamuelOliveirads wants to merge 1 commit into
ikawrakow:mainfrom
SamuelOliveirads:feat/speculative-runtime-core

Conversation

@SamuelOliveirads

Copy link
Copy Markdown
Collaborator

The server currently handles a significant amount of speculative logic, which complicates the implementation and maintenance of new features that aim to replicate this functionality, such as llama-cli or llama-sweep-bench.

The goal here is to migrate as much as possible into the speculative and common modules. To ensure consistency, I conducted benchmarks for several models:

MTP Full (10 runs × 1500 tokens)

Model Task Branch Main Δ
Qwen 27B n=3 code 66.3 ± 0.3 t/s @ 82.3% 66.4 ± 0.4 t/s @ 83.6% −0.2%
extract 59.4 ± 0.2 t/s @ 71.6% 60.4 ± 0.2 t/s @ 74.8% −1.7%
story 47.0 ± 0.3 t/s @ 48.8% 44.7 ± 0.3 t/s @ 46.1% +5.1%
Overall 57.6 ± 8.1 t/s 57.2 ± 9.3 t/s +0.7%
Gemma 4 31B n=3 code 63.2 ± 0.3 t/s @ 85.2% 64.1 ± 0.3 t/s @ 85.2% −1.4%
extract 61.3 ± 0.4 t/s @ 86.6% 61.8 ± 0.3 t/s @ 86.6% −0.8%
story 39.3 ± 0.1 t/s @ 42.0% 39.2 ± 0.1 t/s @ 42.0% +0.3%
Overall 54.6 ± 11.0 t/s 55.0 ± 11.4 t/s −0.7%

Ngram & Draft Smoke (1 run × 1500 tokens)

Model Method Branch Main Δ
Qwen 27B ngram-mod 27.3 t/s @ 9.5% 27.5 t/s @ 9.5% +0.7%
GLM 4.5 Air draft n=8 17.2 t/s @ 14.5% 17.6 t/s @ 14.5% +2.3%

@ikawrakow

Copy link
Copy Markdown
Owner

Generally looks good, but there is still an issue with the main and draft contexts going out of sync.

If I change LOG_DBG to LOG_INF in mtp_update_kv_cache and also print the result of llama_kv_cache_seq_pos_max, here is what I get running the query of @sayap (PR #1894):

[MTP-UPDATE|PROMPT_WARMUP] Updating 1147 tokens for seq_id 0 from pos 0. pos_max = 0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 1146, pos_max = 1146, n_tokens = 1147, size = 149.636 MiB, took 23.95 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140017520398336" timestamp=1781178423 id_slot=0 id_task=0 p0=1147
[MTP-UPDATE|PROMPT_WARMUP] Updating 5 tokens for seq_id 0 from pos 1147. pos_max = 1146
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1153. pos_max = 1152
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1158. pos_max = 1156
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1163. pos_max = 1161
[MTP-UPDATE|GEN_ACCEPTED] Updating 4 tokens for seq_id 0 from pos 1168. pos_max = 1166
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1172. pos_max = 1170
[MTP-UPDATE|GEN_ACCEPTED] Updating 5 tokens for seq_id 0 from pos 1177. pos_max = 1175
[MTP-UPDATE|GEN_ACCEPTED] Updating 2 tokens for seq_id 0 from pos 1182. pos_max = 1180
[MTP-UPDATE|GEN_ACCEPTED] Updating 1 tokens for seq_id 0 from pos 1184. pos_max = 1182
[MTP-UPDATE|GEN_ACCEPTED] Updating 2 tokens for seq_id 0 from pos 1185. pos_max = 1182
[MTP-UPDATE|GEN_ACCEPTED] Updating 1 tokens for seq_id 0 from pos 1187. pos_max = 1185
...

I.e., at the second accepted batch we get a gap in the sequence of the MTP context. Not showing here, but if I then also log the drafted tokens, I observe that quite often it drafts the same token two times in a row. This leads to the observed unreasonably low acceptance rate. PR #1894 somehow fixes that, but only if the u-batch size is 512 or 256.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants