Skip to content

Shuffle less for SSE4.1#137

Open
TobiSchluter wants to merge 3 commits intovitaut:mainfrom
TobiSchluter:shuffle_less
Open

Shuffle less for SSE4.1#137
TobiSchluter wants to merge 3 commits intovitaut:mainfrom
TobiSchluter:shuffle_less

Conversation

@TobiSchluter
Copy link
Copy Markdown
Contributor

@TobiSchluter TobiSchluter commented May 7, 2026

In the current code, we shuffle to reverse the byte order and we shuffle to shift the bits. This combines the two. It also replaces a slow memmove for the decimal digits with another shuffle + store, so that the data is never re-read from memory.

This is my first AI-assisted patch to zmij, and the AI's main task was to figure out the special cases when moving the writing of the last digit behind the shuffles. Previously, the last_digit only sometimes was part of the memmove and could thus end up in a different position. The AI suggested to keep a table of where it goes, and so I did that. A table-free approach made the compiler add a branch and lose time.

The way I avoided to increase the table size for the various bswap + shift variations and thus messing up alignments or introducing padding is quite simple: what we want to do is to reverse the order and drop the first N digits, where N depends on whether its the integral part or the decimals, and it depends on the exponent. We don't care what is added in the end past the actual digits. So, it doesn't actually matter what the last few entries in the shuffler are, and so we just keep the same bswap shuffle table, read it at an offset and don't care what the entries at the end end up being. This only works because we reverse at the same time, so this cannot be carried to the NEON code. The additional bytes are loaded from the same struct, so it's not uninitialized memory and ASAN is happy, but I'm not sure what the standard says about this. I don't think the SIMD memory access is in its scope. BTW I changed the type of bswap to char * in order to get simpler index algebra and less casts.

Finally, in order to keep everything branch-free, in the case of dec_exp < 0 where no decimal point is inserted between the digits, the identical shuffle + store operation is simply repeated. This turned out to be measurably faster than conditionally jumping past it. Note that the zmij/canada benchmark, given Canada's geography, doesn't contain any numbers < 1, so I created a separate benchmark (not included) that uses random fixed numbers. To my surprise it was actually faster than the processing of the Canada data. A possible explanation is that it triggered the last_digit far less often but I didn't check this in detail.

On my Zen 4 this saves 5%, on my Tiger Lake it is performance neutral.

I also asked Claude to do try the same thing for NEON but none of the variations it tried came out faster than what is there right now.

ps I just saw that I didn't clean this up fully. Hence the non-squashed three-patch series with one unrelated change that I didn't mean to commit as part of this: I moved the alignas(64) annotation from the static_data declaration to the struct data declaration. The idea being that this way all the pointers that we use remember the alignment. Turns out the assembly with and without this change is the same on x64 https://godbolt.org/z/Mxz1fbKjY and ARM64 https://godbolt.org/z/ecnPzTWhK, but I think it's cleaner, so I'm not going to remove it right now even though it is unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant