-
-
Notifications
You must be signed in to change notification settings - Fork 122
RFC: Add memchr8 based on candidate phase of Teddy #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am also not sure how to integrate this into the existing benchmarks. Ideally I would want to compare the SIMD variants against the fallback to determine whether this is worth it? |
Thanks! I'm not sure this is really within my bandwidth to maintain right now. A |
Note that this is actually a
I did try this over at
Could you expand on your reasoning here? Considering that according to Intel's intrinsics documentation e.g. |
At least when building the look-up tables is either constant-folded or amortised. |
Processing more matches means more overhead. I did miss that this covers all needle variations. But I really just don't have the review bandwidth for something new like this. It needs to be considered holistically in the memchr API. Moreover, I would definitely want to understand why this is better than aho-corasick's Teddy implementation. I think abo-corasick specializes for each needle length? So I would want to see if that could be made faster first.
Yeah that would be quite interesting! This crate uses rebar for benchmarks and I think you could probably bake them off within that framework. |
These are the results where "eight" means three needles using the
So my take would be that it is certainly slower, but also comes close at times and usually lands in between the the direct AVX2 and the fallback/SWAR verions. The "empty" case is much slower though (40 versus 60 ns) which might still point at higher searcher construction cost. From counting intrinsics/instructions, I would guess the additional shifting and masking to get separate nibbles also adds extra costs, so it might look better against a hypothetical In any case,
so let's leave it at that. (I would have to maintain my bespoke version in any case, as I really need its "7+1" semantics - find any of |
Thanks for following up! That is an interesting data point. I do loosely hope to expand |
Obviously not a fully fleshed out implementation, but enough to run unit tests and ideally decide whether this is worth pursuing.