Skip to content

A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next or whatever but may not run in 1 shot.#1746

Draft
FNsi wants to merge 1 commit into
ikawrakow:mainfrom
FNsi:Pyscript
Draft

A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next or whatever but may not run in 1 shot.#1746
FNsi wants to merge 1 commit into
ikawrakow:mainfrom
FNsi:Pyscript

Conversation

@FNsi

@FNsi FNsi commented May 6, 2026

Copy link
Copy Markdown

Currently I think it's working for Q8_0, Q6_0, IQ4KS, IQ4KSS, f16, f32, IQ2KS tensors, more Quant type may not support yet, but can be easily extended base on current code. (row size related)

AI usage : This script made with Kimi k2.6


below wrote by Kimi


Usage

Duplicate Layers

Copy layers 7, 8, 9, 10 and insert them after layer 10:

python gguf_layer_editor.py input.gguf output.gguf --duplicate 7 8 9 10

Multiple duplication ranges in one shot:

python gguf_layer_editor.py input.gguf output.gguf \
  --duplicate 7 8 9 10 \
  --duplicate 15 16 17 18 \
  --duplicate 30 31 32 33

Delete Layers

Remove layers 4, 5, 6, 7:

python gguf_layer_editor.py input.gguf output.gguf --delete 4 5 6 7

Note: For hybrid architectures, deletions must preserve the recurrent/full-attention pattern. The script validates this and aborts with a clear error if the pattern would break. Generally, delete multiples of 4 consecutive layers.

Combined Operations

Delete some layers and duplicate others:

python gguf_layer_editor.py input.gguf output.gguf \
  --delete 4 5 6 7 \
  --duplicate 30 31 32 33

Verbose Mode

Add --verbose for detailed logging:

python gguf_layer_editor.py input.gguf output.gguf --duplicate 7 8 9 10 --verbose

Architecture Notes

These models alternate between two layer types:

  • Recurrent layers (linear attention): Have attn_qkv and ssm_* tensors
  • Full-attention layers: Have attn_q, attn_k, attn_v tensors

The pattern is determined by (layer_idx + 1) % full_attention_interval:

  • == 0: full-attention layer
  • != 0: recurrent layer

When duplicating or deleting, the script ensures layers end up at positions with matching types. Incompatible layers are automatically skipped with a warning.


basically I think we can base on that, after some edition, add mtp layer to existed GGUF file.

@FNsi FNsi changed the title A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next A GGUF editor, which can be use to duplicate or delete layers in Qwen3.5 / Qwen Coder Next or whatever but may not run in 1 shot. May 6, 2026
@FNsi

FNsi commented May 6, 2026

Copy link
Copy Markdown
Author

And...as a guy mostly use GitHub in mobile...

Feel free to edit anything without asking.

@ikawrakow

Copy link
Copy Markdown
Owner

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

@FNsi

FNsi commented May 6, 2026

Copy link
Copy Markdown
Author

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

Well, I like that idea, but Kimi not in my phone so...

in case some model get stubborn due to overtraining, delete some earlier layers makes it more thoughtful;

In case we think models reasoning is not strong enough, simple duplicate middle or later layer could help;

Some evidence? As what I remember there's some test related to rys repetition? Which pointed out the duplicate beliefs;

Or we maybe don't need that much, just a new toy to play with. 😁

@usrlocalben

usrlocalben commented May 6, 2026

Copy link
Copy Markdown
Contributor

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

Try this Modern LLM Hacking and hints of a Universal Language

tl;dr repeating certain layers in Qwen-28B makes it smarter. with some analysis one can choose an optimal set of layers to repeat. with some help from the engine, one could repeat those layers without addl. VRAM area.

@Ph0rk0z

Ph0rk0z commented May 6, 2026

Copy link
Copy Markdown

Doubling the middle layers is actually a viable way of improving models, as papers and other stuff say.. But this is the wrong way to do it. Using more vram is not the way to go: turboderp-org/exllamav3#174

@ikawrakow

Copy link
Copy Markdown
Owner

Perhaps Kimi can start the PR description by explaining to us the utility of duplicating and removing layers?

Try this Modern LLM Hacking and hints of a Universal Language

tl;dr repeating certain layers in Qwen-28B makes it smarter. with some analysis one can choose an optimal set of layers to repeat. with some help from the engine, one could repeat those layers without addl. VRAM area.

And the other day I saw a post claiming that TurboQuant was the best thing since sliced bread. Oh, wait, I actually saw many such posts. If we started doing everything that we saw on the Internet, the entire computing infrastructure of the universe wouldn't be enough to run the agents implementing what somebody posted somewhere.

I'm familiar with people pruning models. We have more than enough of those, no need to add yet another source of them. At least a pruned model is smaller than the original, so it makes sense to have it stored on disk.

I'm also familiar with people merging models, people repeating layers in models, etc. But it is one thing someone running calculations for weeks to find out which layers to duplicate, and quite another some user duplicating a bunch of randomly chosen layers.

But if one wanted to seriously support the feature of duplicating layers, one would do it on the inference engine level, and not by having a Kimi generated script produce a larger model stored on disk that we need to load and keep in RAM/VRAM.

@FNsi

FNsi commented May 6, 2026

Copy link
Copy Markdown
Author

Indeed I have tried that with Qwen-coder-next, after waste number of TB, I got a better result by deleting 4-7 layers, feel free to test https://huggingface.co/Jahaz/Qwen-Coder-NX-73B

@FNsi

FNsi commented May 6, 2026

Copy link
Copy Markdown
Author

Btw the very beginning idea to upload that script is for future possible mtp add; which obviously not included by that script, but the padding, row size situation, offset logic, could be simply reused.

So that's the first point for me to keep it a draft;

Another point is, let more people can play with that, finetune free, no origin safetensor needed;

Third point is, keep it in an easy way, people will produce stronger model, and I might able to use it without find the best combo by myself.

Fourth point is, let's make something fun haha.

But final point is, comparing to let it die in my disk, like many other things, random deleted by no reason, upload here at least more people can play with.

@FNsi

FNsi commented May 6, 2026

Copy link
Copy Markdown
Author

this is the wrong way to do it. Using more vram is not the way to go: turboderp-org/exllamav3#174

if one wanted to seriously support the feature of duplicating layers, one would do it on the inference engine level,

Woo, after a moment I actually think that really a good idea, if we can let model self decide to activate duplication / ignore layers during inference!!!!

@usrlocalben

usrlocalben commented May 6, 2026

Copy link
Copy Markdown
Contributor

And the other day I saw a post claiming that TurboQuant was the best thing since sliced bread. Oh, wait, I actually saw many such posts. If we started doing everything that we saw on the

😅 I'm just giving some context , not defending random internet ideas

@Ph0rk0z

Ph0rk0z commented May 7, 2026

Copy link
Copy Markdown

I thought someone did benchmarks and the scores raised. It's not a new idea for sure and finding which layers to replay sounds like the challenge.

@FNsi

FNsi commented May 7, 2026

Copy link
Copy Markdown
Author

It's only a single script draft guys. just have fun to play it😂

@Ph0rk0z

Ph0rk0z commented May 7, 2026

Copy link
Copy Markdown

Yea I know but it would be a really neat experiment to do it virtually. Brute forcing weights on disk is painful.

@FNsi

FNsi commented May 7, 2026

Copy link
Copy Markdown
Author

Yea I know but it would be a really neat experiment to do it virtually. Brute forcing weights on disk is painful.

I agree, and I also experiment layer swift switching today, surprisingly that also won't easily break models reasoning, anyway it really hurt ssd😂😂, if the inference level is ready then it would be really great to autotune PPL with plenty of combos!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants