Hitting a bit of a wall #85

osotsia · 2025-02-20T03:50:01Z

osotsia
Feb 20, 2025

This speedrun is looking like a very good 10x cumulative improvement so far. But it seems to be hitting a bit of a wall.

I'm trying to think ahead. Is it correct to say that since AI performance depends on the log of compute, we'd have to hit a 100x cumulative improvement for the next step up? So 31 min -> 3 min -> 0.3 min? Because if so, that might require something truly radical.

linux-leo · 2025-02-21T19:41:02Z

linux-leo
Feb 21, 2025

I agree, knowing that the effort required to get a substantial record is now so high, I'm struggling to find motivation.

I know suggestions are meaningless without someone doing the work, and every "radical" Idea I've had looks like so much work that I'm scared of blowing through my budget multiple times over. But let me restate them, because I think they are worth pursuing:

Forgetting Transformer: Softmax Attention with a Forget Gate: https://openreview.net/forum?id=q2Lnyegkr8 (See: Speedrunning ideas discussion #23 (comment))
Exploring the Benefit of Activation Sparsity in Pre-training: https://arxiv.org/abs/2410.03440

I think the best option would be for someone to implement kv shifting as outlined in the forgetting transformer. It's (hopefully) a performance improvement by itself and with it, all pre-requisites for implementing the forgetting transformer are fullfilled. And it looks relatively simple, there are other papers that have suggested kv shifting and even provided reference code: https://arxiv.org/pdf/2411.19574

5 replies

osotsia Feb 22, 2025
Author

I just did some quick KV shifting work. It was better at step 125 of training: val loss = 4.55 vs 4.65 in the current leader (RuleTweak). But RuleTweak quickly made up ground, and they ended at the same place. My observation is the improvements in that Baichuan AI paper are not that big (# steps saved at the same horizontal loss value Fig 4,6). Good instincts though, it's still the best experiment I've done so far. Will work on the rest and let you know if I find something.

linux-leo Feb 22, 2025

Nice Work @osotsia! Two suggestions:

Applying qk norm after shifting like it was done in the forgetting transformer might improve performance. Idk If RoPE should be applied before or after shifting, but might as well also put it after shifting so positional information isn't lost.
The Baichuan paper implements data independent kv shifting, while RWKV and the forgetting transformer use data dependent kv shifting, which should be even better, see Equation 14 and Appendix A of the Forgetting Transformer paper.

osotsia Feb 23, 2025
Author

Did those suggestions. Same pattern in the val losses.

I think I'll go straight for the forget gate. The ablations in the forgetting transformer paper (Table 3) indicates it might have the biggest gains. Those tend to be more transferable to different contexts. I'll pick it up next week.

linux-leo Feb 23, 2025

Yeah I was following the commits in your fork @osotsia. Great work. From the multiple versions you committed, which ones did you test and which had the lowest loss in the earlier training steps? I just want to know the best trade off between the complexity of the implementation (and it's overhead) and the validation loss.

I know forgetting transformers are the biggest performance improvement, but combining it with kv shifting methods could allow removing RoPE, which would further reduce runtime, similar to how it was done in the forgetting transformer paper, just without output norm and gate.

osotsia Feb 23, 2025
Author

Data-independent 09eba8b

Data-dependent shift 680e1de

Those are the working tested ones (the other commits usually had something wrong with them). They performed basically the same (in terms of early val loss reduction), but yeah the data-dependent one was slower. You can just grab the class CausalSelfAttention if you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hitting a bit of a wall #85

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Hitting a bit of a wall #85

osotsia Feb 20, 2025

Replies: 1 comment · 5 replies

linux-leo Feb 21, 2025

osotsia Feb 22, 2025 Author

linux-leo Feb 22, 2025

osotsia Feb 23, 2025 Author

linux-leo Feb 23, 2025

osotsia Feb 23, 2025 Author

osotsia
Feb 20, 2025

Replies: 1 comment 5 replies

linux-leo
Feb 21, 2025

osotsia Feb 22, 2025
Author

osotsia Feb 23, 2025
Author

osotsia Feb 23, 2025
Author