Shuffle less for SSE4.1#137
Open
TobiSchluter wants to merge 3 commits intovitaut:mainfrom
Open
Conversation
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In the current code, we shuffle to reverse the byte order and we shuffle to shift the bits. This combines the two. It also replaces a slow
memmovefor the decimal digits with another shuffle + store, so that the data is never re-read from memory.This is my first AI-assisted patch to zmij, and the AI's main task was to figure out the special cases when moving the writing of the last digit behind the shuffles. Previously, the last_digit only sometimes was part of the memmove and could thus end up in a different position. The AI suggested to keep a table of where it goes, and so I did that. A table-free approach made the compiler add a branch and lose time.
The way I avoided to increase the table size for the various bswap + shift variations and thus messing up alignments or introducing padding is quite simple: what we want to do is to reverse the order and drop the first N digits, where N depends on whether its the integral part or the decimals, and it depends on the exponent. We don't care what is added in the end past the actual digits. So, it doesn't actually matter what the last few entries in the shuffler are, and so we just keep the same bswap shuffle table, read it at an offset and don't care what the entries at the end end up being. This only works because we reverse at the same time, so this cannot be carried to the NEON code. The additional bytes are loaded from the same struct, so it's not uninitialized memory and ASAN is happy, but I'm not sure what the standard says about this. I don't think the SIMD memory access is in its scope. BTW I changed the type of
bswaptochar *in order to get simpler index algebra and less casts.Finally, in order to keep everything branch-free, in the case of
dec_exp < 0where no decimal point is inserted between the digits, the identical shuffle + store operation is simply repeated. This turned out to be measurably faster than conditionally jumping past it. Note that the zmij/canada benchmark, given Canada's geography, doesn't contain any numbers < 1, so I created a separate benchmark (not included) that uses random fixed numbers. To my surprise it was actually faster than the processing of the Canada data. A possible explanation is that it triggered thelast_digitfar less often but I didn't check this in detail.On my Zen 4 this saves 5%, on my Tiger Lake it is performance neutral.
I also asked Claude to do try the same thing for NEON but none of the variations it tried came out faster than what is there right now.
ps I just saw that I didn't clean this up fully. Hence the non-squashed three-patch series with one unrelated change that I didn't mean to commit as part of this: I moved the
alignas(64)annotation from thestatic_datadeclaration to thestruct datadeclaration. The idea being that this way all the pointers that we use remember the alignment. Turns out the assembly with and without this change is the same on x64 https://godbolt.org/z/Mxz1fbKjY and ARM64 https://godbolt.org/z/ecnPzTWhK, but I think it's cleaner, so I'm not going to remove it right now even though it is unrelated.