Replies: 1 comment 5 replies
-
I agree, knowing that the effort required to get a substantial record is now so high, I'm struggling to find motivation. I know suggestions are meaningless without someone doing the work, and every "radical" Idea I've had looks like so much work that I'm scared of blowing through my budget multiple times over. But let me restate them, because I think they are worth pursuing:
I think the best option would be for someone to implement kv shifting as outlined in the forgetting transformer. It's (hopefully) a performance improvement by itself and with it, all pre-requisites for implementing the forgetting transformer are fullfilled. And it looks relatively simple, there are other papers that have suggested kv shifting and even provided reference code: https://arxiv.org/pdf/2411.19574 |
Beta Was this translation helpful? Give feedback.
-
This speedrun is looking like a very good 10x cumulative improvement so far. But it seems to be hitting a bit of a wall.
I'm trying to think ahead. Is it correct to say that since AI performance depends on the log of compute, we'd have to hit a 100x cumulative improvement for the next step up? So 31 min -> 3 min -> 0.3 min? Because if so, that might require something truly radical.
Beta Was this translation helpful? Give feedback.
All reactions