Model likes it for Value Embedding at a deeper layer? [A sweep experiment at small scale] #463

jojo23333 · 2026-01-25T21:42:29Z

jojo23333
Jan 25, 2026

[Experiment] Where to insert “Value Embeddings” in nanoGPT-style Transformers? Deeper seems better (at this scale)

I’ve been playing with the “value embeddings” (VE) idea recently added in the nanoGPT ecosystem, and ran a coarse sweep to test where in the depth the model seems to prefer VE to be inserted: shallower vs deeper layers.

Setup

Model depth: 12 layers (d12)
Training tokens: ~3.7–3.9B
Token/parameter ratio: roughly 10–15× (i.e., “reasonably trained” at this scale, not a tiny undertrained run)
Compute control: I controlled budget primarily by training tokens (train time not comparable due to different GPUs)
VE adds parameters but relatively small additional FLOPs vs full model scaling (aside from optimizer state updates)

Sweep design

I compared:

Baseline (no VE)
Default VE (spread across multiple layers: L1,3,7,9,11)
Two-layer VE placements: (L1–2), (L3–4), …, (L9–10)

Key results (val bpb ↓, CORE eval ↑)

run	params	tokens	val bpb	CORE eval
d12 baseline (no VE)	185.6M	3.7B	0.9039	0.1262
~~VE default (L1,3,7,9,11)~~	487.6M	3.9B	~~0.8863~~	~~0.1495~~
VE @ L1–2	286.3M	3.9B	0.8653	0.1534
VE @ L3–4	286.3M	3.9B	0.8650	0.1578
VE @ L5–6	286.3M	3.9B	0.8626	0.1586
VE @ L7–8	286.3M	3.9B	0.8583	0.1680
VE @ L9–10	286.3M	3.9B	0.8569	0.1810

Update: there is some issue with baseline here, see updated table in comment

Main observation

Performance monotonically improves as VE is moved deeper, with the best result at L9–10.

This surprised me: I expected VE to help earlier representations more, but at least at this scale, deeper VE wins.

Hypotheses / interpretation

Gradient/credit assignment advantage at small scale: deeper insertion might give VE a more direct path to influence logits, improving learning efficiency when capacity is limited.
Over-parameterization under fixed token budget: the default VE (many layers, big param jump to ~488M) performs worse than the best 2-layer placements on val bpb, which might indicate it's on the right side of a Chinchilla-style compute-optimal curve (too many params for the available tokens).

Questions

Have anyone observed a “VE prefers deeper layers” effect in your own experiments? @karpathy Does this persist when scaling up (more layers / larger models / more tokens), or do you think it mostly a small-scale artifact?

jojo23333 · 2026-01-25T21:53:29Z

jojo23333
Jan 25, 2026
Author

Oh btw I didn't decay the learning rate in my experiment due to the fact that I wanted to continue training. Other configurations stays default.

0 replies

karpathy · 2026-01-25T22:03:57Z

karpathy
Jan 25, 2026
Maintainer

I wasn't able to reproduce this in my own fair comparison, I think possibly the comparison is iffy in some hard to tell way. Possibly you can share the launch commands and/or diffs you're using.

0 replies

jojo23333 · 2026-01-26T07:39:15Z

jojo23333
Jan 26, 2026
Author

Hi, there was indeed a baseline mismatching issue in the previous table, sorry for my oversigt.
I updated my experiments with:

This time to ensure limited deviation with current recipe, I forked from the latest code build and modify the code to adapt to my setup:

GPU: 8xRTX 6000 (96G with Blackwell)
Updated Code: https://github.com/jojo23333/nanochat
Changes made from the latest code:

fixed to 2e18 Training Flops (the wall clock time is quite simillar for my sweeps(as expected))
disabled FlashAttention3 (since it's seems to be a bit problematic on blackwell architecture gpu)
set window-pattern to "L" (as suggested by the default warning message)
some improvement for sweep logging.
only do pretraining stage for base model

# changed line in speed run
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=12 --target-param-data-ratio=-1 --target-flops 2e18 --value-embed-layers=$VALUE_EMBED_LAYERS --run=$WANDB_RUN --window-pattern="L" --model-tag=$MODEL_TAG

Other than that no default setting are touched, following command should reproduce the result:

bash sweep_ve_position.sh

ALL training tokens and number iterations are 1.9B/3675 matching $2\times10^{18}$ Training Flops

Below are the updated result

There seems some issue with the baseline performance I showed above, I apologize for the oversight. Now the val bpb seems to be as expected
The trend of deeper value embedding layer bring better performance seems to hold true.
CORE eval seems not be the best? Though it's mentioned core eval is relatively noisy in low training flops

run	params	val bpb	CORE eval	train time (m)	Training Flops
sweep_12	286M	0.8892	0.1252	19.24	2e10^18
sweep_34	286M	0.8863	0.1324	19.27	2e10^18
sweep_56	286M	0.8837	0.1448	19.23	2e10^18
sweep_78	286M	0.8810	0.1489	19.32	2e10^18
sweep_910	286M	0.8796	0.1461	19.27	2e10^18
default_(1,3,5,7,9,11)	488M	0.8727	0.1413	21.01	2e10^18

Given the observation of potentially deeper -> better for current training flops, I played a bit more on different setting, assigning more embedding to deeper layers compared to the shallow layers and was able to get a little bit more squeeze of performance. The best setting I have in hand is 1,3,7,9,10,11 (moves 5->10 compared to default) which have better val bpd and better CORE compared to default

run	Embed Layers	params	val bpb	CORE eval	train time (m)	Training Flops
default	1,3,5 \| 7,9,11	488M	0.8727	0.1413	21.01	2e10^18
all	1-11	739M	0.8712	0.1390	23.25	2e10^18
shallow6	1-6	488M	0.8807	0.1406	21.04	2e10^18
deep6	6-11	488M	0.8751	0.1542	21.03	2e10^18
shallow2-deep4	1,4 \| 7,9,10,11	488M	0.8719	0.1502	21.05	2e10^18
shallow2-deep4	1,3 \| 7,9,10,11	488M	0.8712	0.1526	21.06	2e10^18

Probably the improvement seems to be marginal for bpd. But I notice one thing is that it seems in all case the more budget you assign to deeper layers, the better the CORE metric became.

deep6 (6-11) -> default, +0.024 val bpd, +0.013 CORE
sweep_910 -> default, +0.069 val bpd, +0.005 CORE
1,3,7,9,10,11 -> default, -0.015 val bpd, + 0.011 CORE

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model likes it for Value Embedding at a deeper layer? [A sweep experiment at small scale] #463

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Model likes it for Value Embedding at a deeper layer? [A sweep experiment at small scale] #463

Uh oh!

Uh oh!

jojo23333 Jan 25, 2026

Setup

Sweep design

Key results (val bpb ↓, CORE eval ↑)

Main observation

Hypotheses / interpretation

Questions

Replies: 3 comments

Uh oh!

jojo23333 Jan 25, 2026 Author

Uh oh!

karpathy Jan 25, 2026 Maintainer

Uh oh!

Uh oh!

jojo23333 Jan 26, 2026 Author

jojo23333
Jan 25, 2026

jojo23333
Jan 25, 2026
Author

karpathy
Jan 25, 2026
Maintainer

jojo23333
Jan 26, 2026
Author