A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next or whatever but may not run in 1 shot. by FNsi · Pull Request #1746 · ikawrakow/ik_llama.cpp

FNsi · 2026-05-06T09:35:18Z

I have read the contributing guidelines
Self-reported review complexity:
- [*] Low
- Medium
- High

Currently I think it's working for Q8_0, Q6_0, IQ4KS, IQ4KSS, f16, f32, IQ2KS tensors, more Quant type may not support yet, but can be easily extended base on current code. (row size related)

AI usage : This script made with Kimi k2.6

below wrote by Kimi

Usage

Duplicate Layers

Copy layers 7, 8, 9, 10 and insert them after layer 10:

python gguf_layer_editor.py input.gguf output.gguf --duplicate 7 8 9 10

Multiple duplication ranges in one shot:

python gguf_layer_editor.py input.gguf output.gguf \
  --duplicate 7 8 9 10 \
  --duplicate 15 16 17 18 \
  --duplicate 30 31 32 33

Delete Layers

Remove layers 4, 5, 6, 7:

python gguf_layer_editor.py input.gguf output.gguf --delete 4 5 6 7

Note: For hybrid architectures, deletions must preserve the recurrent/full-attention pattern. The script validates this and aborts with a clear error if the pattern would break. Generally, delete multiples of 4 consecutive layers.

Combined Operations

Delete some layers and duplicate others:

python gguf_layer_editor.py input.gguf output.gguf \
  --delete 4 5 6 7 \
  --duplicate 30 31 32 33

Verbose Mode

Add --verbose for detailed logging:

python gguf_layer_editor.py input.gguf output.gguf --duplicate 7 8 9 10 --verbose

Architecture Notes

These models alternate between two layer types:

Recurrent layers (linear attention): Have attn_qkv and ssm_* tensors
Full-attention layers: Have attn_q, attn_k, attn_v tensors

The pattern is determined by (layer_idx + 1) % full_attention_interval:

== 0: full-attention layer
!= 0: recurrent layer

When duplicating or deleting, the script ensures layers end up at positions with matching types. Incompatible layers are automatically skipped with a warning.

basically I think we can base on that, after some edition, add mtp layer to existed GGUF file.

FNsi · 2026-05-06T09:42:43Z

And...as a guy mostly use GitHub in mobile...

Feel free to edit anything without asking.

ikawrakow · 2026-05-06T10:13:12Z

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

FNsi · 2026-05-06T10:35:02Z

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

Well, I like that idea, but Kimi not in my phone so...

in case some model get stubborn due to overtraining, delete some earlier layers makes it more thoughtful;

In case we think models reasoning is not strong enough, simple duplicate middle or later layer could help;

Some evidence? As what I remember there's some test related to rys repetition? Which pointed out the duplicate beliefs;

Or we maybe don't need that much, just a new toy to play with. 😁

usrlocalben · 2026-05-06T14:18:29Z

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

Try this Modern LLM Hacking and hints of a Universal Language

tl;dr repeating certain layers in Qwen-28B makes it smarter. with some analysis one can choose an optimal set of layers to repeat. with some help from the engine, one could repeat those layers without addl. VRAM area.

Ph0rk0z · 2026-05-06T14:50:27Z

Doubling the middle layers is actually a viable way of improving models, as papers and other stuff say.. But this is the wrong way to do it. Using more vram is not the way to go: turboderp-org/exllamav3#174

ikawrakow · 2026-05-06T14:56:08Z

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

Try this Modern LLM Hacking and hints of a Universal Language

tl;dr repeating certain layers in Qwen-28B makes it smarter. with some analysis one can choose an optimal set of layers to repeat. with some help from the engine, one could repeat those layers without addl. VRAM area.

And the other day I saw a post claiming that TurboQuant was the best thing since sliced bread. Oh, wait, I actually saw many such posts. If we started doing everything that we saw on the Internet, the entire computing infrastructure of the universe wouldn't be enough to run the agents implementing what somebody posted somewhere.

I'm familiar with people pruning models. We have more than enough of those, no need to add yet another source of them. At least a pruned model is smaller than the original, so it makes sense to have it stored on disk.

I'm also familiar with people merging models, people repeating layers in models, etc. But it is one thing someone running calculations for weeks to find out which layers to duplicate, and quite another some user duplicating a bunch of randomly chosen layers.

But if one wanted to seriously support the feature of duplicating layers, one would do it on the inference engine level, and not by having a Kimi generated script produce a larger model stored on disk that we need to load and keep in RAM/VRAM.

FNsi · 2026-05-06T15:04:34Z

Indeed I have tried that with Qwen-coder-next, after waste number of TB, I got a better result by deleting 4-7 layers, feel free to test https://huggingface.co/Jahaz/Qwen-Coder-NX-73B

FNsi · 2026-05-06T15:21:18Z

Btw the very beginning idea to upload that script is for future possible mtp add; which obviously not included by that script, but the padding, row size situation, offset logic, could be simply reused.

So that's the first point for me to keep it a draft;

Another point is, let more people can play with that, finetune free, no origin safetensor needed;

Third point is, keep it in an easy way, people will produce stronger model, and I might able to use it without find the best combo by myself.

Fourth point is, let's make something fun haha.

But final point is, comparing to let it die in my disk, like many other things, random deleted by no reason, upload here at least more people can play with.

FNsi · 2026-05-06T15:48:05Z

this is the wrong way to do it. Using more vram is not the way to go: turboderp-org/exllamav3#174

if one wanted to seriously support the feature of duplicating layers, one would do it on the inference engine level,

Woo, after a moment I actually think that really a good idea, if we can let model self decide to activate duplication / ignore layers during inference!!!!

usrlocalben · 2026-05-06T23:13:52Z

And the other day I saw a post claiming that TurboQuant was the best thing since sliced bread. Oh, wait, I actually saw many such posts. If we started doing everything that we saw on the

😅 I'm just giving some context , not defending random internet ideas

Ph0rk0z · 2026-05-07T00:16:29Z

I thought someone did benchmarks and the scores raised. It's not a new idea for sure and finding which layers to replay sounds like the challenge.

FNsi · 2026-05-07T00:26:13Z

It's only a single script draft guys. just have fun to play it😂

Ph0rk0z · 2026-05-07T14:33:43Z

Yea I know but it would be a really neat experiment to do it virtually. Brute forcing weights on disk is painful.

FNsi · 2026-05-07T16:04:58Z

Yea I know but it would be a really neat experiment to do it virtually. Brute forcing weights on disk is painful.

I agree, and I also experiment layer ~~swift~~ switching today, surprisingly that also won't easily break models reasoning, anyway it really hurt ssd😂😂, if the inference level is ready then it would be really great to autotune PPL with plenty of combos!

Add files via upload

8638bfa

FNsi changed the title ~~A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next~~ A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next or whatever but may not run in 1 shot. May 6, 2026

Conversation

FNsi commented May 6, 2026

Usage

Duplicate Layers

Delete Layers

Combined Operations

Verbose Mode

Architecture Notes

Uh oh!

FNsi commented May 6, 2026

Uh oh!

ikawrakow commented May 6, 2026

Uh oh!

FNsi commented May 6, 2026

Uh oh!

usrlocalben commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented May 6, 2026

Uh oh!

ikawrakow commented May 6, 2026

Uh oh!

FNsi commented May 6, 2026

Uh oh!

FNsi commented May 6, 2026

Uh oh!

FNsi commented May 6, 2026

Uh oh!

usrlocalben commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented May 7, 2026

Uh oh!

FNsi commented May 7, 2026

Uh oh!

Ph0rk0z commented May 7, 2026

Uh oh!

FNsi commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

usrlocalben commented May 6, 2026 •

edited

Loading

usrlocalben commented May 6, 2026 •

edited

Loading

FNsi commented May 7, 2026 •

edited

Loading