Replies: 3 comments
-
|
Oh btw I didn't decay the learning rate in my experiment due to the fact that I wanted to continue training. Other configurations stays default. |
Beta Was this translation helpful? Give feedback.
-
|
I wasn't able to reproduce this in my own fair comparison, I think possibly the comparison is iffy in some hard to tell way. Possibly you can share the launch commands and/or diffs you're using. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, there was indeed a baseline mismatching issue in the previous table, sorry for my oversigt. This time to ensure limited deviation with current recipe, I forked from the latest code build and modify the code to adapt to my setup: GPU: 8xRTX 6000 (96G with Blackwell)
# changed line in speed run
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=12 --target-param-data-ratio=-1 --target-flops 2e18 --value-embed-layers=$VALUE_EMBED_LAYERS --run=$WANDB_RUN --window-pattern="L" --model-tag=$MODEL_TAGOther than that no default setting are touched, following command should reproduce the result: ALL training tokens and number iterations are 1.9B/3675 matching Below are the updated result
Given the observation of potentially deeper -> better for current training flops, I played a bit more on different setting, assigning more embedding to deeper layers compared to the shallow layers and was able to get a little bit more squeeze of performance. The best setting I have in hand is 1,3,7,9,10,11 (moves 5->10 compared to default) which have better val bpd and better CORE compared to default
Probably the improvement seems to be marginal for bpd. But I notice one thing is that it seems in all case the more budget you assign to deeper layers, the better the CORE metric became.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[Experiment] Where to insert “Value Embeddings” in nanoGPT-style Transformers? Deeper seems better (at this scale)
I’ve been playing with the “value embeddings” (VE) idea recently added in the nanoGPT ecosystem, and ran a coarse sweep to test where in the depth the model seems to prefer VE to be inserted: shallower vs deeper layers.
Setup
Sweep design
I compared:
Key results (val bpb ↓, CORE eval ↑)
VE default (L1,3,7,9,11)0.88630.1495Update: there is some issue with baseline here, see updated table in comment
Main observation
Performance monotonically improves as VE is moved deeper, with the best result at L9–10.
This surprised me: I expected VE to help earlier representations more, but at least at this scale, deeper VE wins.
Hypotheses / interpretation
Questions
Have anyone observed a “VE prefers deeper layers” effect in your own experiments? @karpathy Does this persist when scaling up (more layers / larger models / more tokens), or do you think it mostly a small-scale artifact?
Beta Was this translation helpful? Give feedback.
All reactions