Description
I've been working on understanding the fast-path literal optimisation which recently trickled through to the cloudflare fork:
Lines 188 to 200 in 35ab598
I'm curious if you tried putting that logic at the end of the handler for the first literal in the main loop (with necessary adjustments):
Lines 204 to 209 in 35ab598
Because it should mean you can eliminate these refills, right?
Lines 217 to 219 in 35ab598
Lines 235 to 238 in 35ab598
That is, at the start of the loop you have at least 56 bits available. If you bite off a couple of fast-path literals before getting into the conventional length/distance decode case then you might be down to something like 34 bits. That means you can no longer guarantee 15 + 5 + 15 + 13 bits avialible, which makes extra top-ups necessary.
I might be wrong about that because there are gotos in there to confuse me. But supposing I didn't mess up...
I figure if you try to scoop up extra literals after the first one, then that's up to 15 bits you've spent (because you might have used a trampoline table entry path for the first literal), leaving 41 bits guaranteed so you can keep sipping up literals at least 41/state->lenbits
times. I think that's actually three more literals, isn't it?
I think this path is a branch prediction stressor, anyway, so it's a lot of cross-platform research which I'm not able to do to back up my "helpful suggestion".