Skip to content

Conversation

adamreichold
Copy link

Obviously not a fully fleshed out implementation, but enough to run unit tests and ideally decide whether this is worth pursuing.

@adamreichold
Copy link
Author

I am also not sure how to integrate this into the existing benchmarks. Ideally I would want to compare the SIMD variants against the fallback to determine whether this is worth it?

@BurntSushi
Copy link
Owner

Thanks! I'm not sure this is really within my bandwidth to maintain right now. A memchr8 without memchr{4,5,6,7} is pretty odd. So adding this seems likely to commit me to supporting those too. And having all of those just seems like not a great user experience overall and results in added maintenance burden. And because it will substantially increase the size of the crate, they will also need to be opt-in features. And that in turn makes testing more annoying. Moreover, I perceive these as very niche APIs. The needles likely need to be quite infrequent in the haystack for something like this to be worth it. Finally, it's not clear to me that it's worth adding this when one can just use aho-corasick. I'm guessing you found this to be faster, but I'm not clear on that.

@adamreichold
Copy link
Author

adamreichold commented Oct 13, 2025

A memchr8 without memchr{4,5,6,7} is pretty odd.

Note that this is actually a memchr012345678, i.e. it supports any number of needles up to and including 8. Of course, naming things is hard. But I think this should mitigate a lot of the concerns about crate size, testing, etc.?

Finally, it's not clear to me that it's worth adding this when one can just use aho-corasick. I'm guessing you found this to be faster, but I'm not clear on that.

I did try this over at robinson and the additional machinery for supporting patterns longer than one byte did indeed have significant overhead. Admittedly, I did not directly compare them, it was just that aho-corasick did not improve upon the scalar implementation whereas this code did.

The needles likely need to be quite infrequent in the haystack for something like this to be worth it.

Could you expand on your reasoning here?

Considering that according to Intel's intrinsics documentation e.g. _mm256_cmpeg_epi8 and _mm256_shuffle_eip8 appear to have the same basic performance characteristics (1 cycle latency, 0.5 cycles-per-instruction throughput), I would actually wonder how this implementation of memchr8 compares to memchr3 when used with three needles.

@adamreichold
Copy link
Author

I would actually wonder how this implementation of memchr8 compares to memchr3 when used for with three needles.

At least when building the look-up tables is either constant-folded or amortised.

@BurntSushi
Copy link
Owner

Could you expand on your reasoning here?

Processing more matches means more overhead. memchr has smaller overhead than memchr2. And so on. As you add more needles, the likelihood of matches increases generally. The regex crate makes a lot of heuristic assumptions around this for its literal optimizations.

I did miss that this covers all needle variations. But I really just don't have the review bandwidth for something new like this. It needs to be considered holistically in the memchr API.

Moreover, I would definitely want to understand why this is better than aho-corasick's Teddy implementation. I think abo-corasick specializes for each needle length? So I would want to see if that could be made faster first.

Considering that according to Intel's intrinsics documentation e.g. _mm256_cmpeg_epi8 and _mm256_shuffle_eip8 appear to have the same basic performance characteristics (1 cycle latency, 0.5 cycles-per-instruction throughput), I would actually wonder how this implementation of memchr8 compares to memchr3 when used for with three needles.

Yeah that would be quite interesting! This crate uses rebar for benchmarks and I think you could probably bake them off within that framework.

@adamreichold
Copy link
Author

adamreichold commented Oct 13, 2025

Yeah that would be quite interesting! This crate uses rebar for benchmarks and I think you could probably bake them off within that framework.

These are the results where "eight" means three needles using the Eight instead of the Three implementation. I also directly called the AVX2 variants (in both cases) to enable constant-folding of the look-up tables for the slice-based variant.

benchmark                        rust/memchr/memchr3  rust/memchr/memchr3/eight  rust/memchr/memchr3/fallback  rust/memchr/memchr3/naive
---------                        -------------------  -------------------------  ----------------------------  -------------------------
memchr/sherlock/common/huge3     629.6 MB/s (1.00x)   530.3 MB/s (1.19x)         208.6 MB/s (3.02x)            339.7 MB/s (1.85x)
memchr/sherlock/common/small3    666.6 MB/s (1.20x)   545.9 MB/s (1.46x)         695.1 MB/s (1.15x)            799.5 MB/s (1.00x)
memchr/sherlock/never/huge3      39.9 GB/s (1.00x)    31.6 GB/s (1.27x)          3.6 GB/s (11.23x)             1071.7 MB/s (38.17x)
memchr/sherlock/never/small3     12.4 GB/s (1.00x)    6.2 GB/s (2.00x)           2.9 GB/s (4.20x)              1018.1 MB/s (12.44x)
memchr/sherlock/never/tiny3      1605.0 MB/s (1.00x)  731.2 MB/s (2.20x)         1096.7 MB/s (1.46x)           731.2 MB/s (2.20x)
memchr/sherlock/never/empty3     40.00ns (1.00x)      60.00ns (1.50x)            40.00ns (1.00x)               40.00ns (1.00x)
memchr/sherlock/rare/huge3       26.8 GB/s (1.00x)    21.3 GB/s (1.26x)          3.3 GB/s (8.06x)              1060.9 MB/s (25.85x)
memchr/sherlock/rare/small3      8.7 GB/s (1.00x)     4.8 GB/s (1.83x)           2.7 GB/s (3.24x)              972.7 MB/s (9.17x)
memchr/sherlock/rare/tiny3       1096.7 MB/s (1.00x)  598.2 MB/s (1.83x)         940.1 MB/s (1.17x)            658.0 MB/s (1.67x)
memchr/sherlock/uncommon/huge3   2.0 GB/s (1.00x)     1944.3 MB/s (1.08x)        844.4 MB/s (2.49x)            770.2 MB/s (2.72x)
memchr/sherlock/uncommon/small3  2.6 GB/s (1.00x)     1972.7 MB/s (1.34x)        1809.3 MB/s (1.46x)           770.4 MB/s (3.43x)
memchr/sherlock/uncommon/tiny3   438.7 MB/s (1.50x)   299.1 MB/s (2.20x)         548.4 MB/s (1.20x)            658.0 MB/s (1.00x)

So my take would be that it is certainly slower, but also comes close at times and usually lands in between the the direct AVX2 and the fallback/SWAR verions.

The "empty" case is much slower though (40 versus 60 ns) which might still point at higher searcher construction cost. From counting intrinsics/instructions, I would guess the additional shifting and masking to get separate nibbles also adds extra costs, so it might look better against a hypothetical memchr4.

In any case,

But I really just don't have the review bandwidth for something new like this.

so let's leave it at that. (I would have to maintain my bespoke version in any case, as I really need its "7+1" semantics - find any of =/> \t\r\n but also mark any : before that - which this API would not provide. Without doing both things in a single pass, I still end up slower than the scalar version for typical XML tag names.)

@BurntSushi
Copy link
Owner

Thanks for following up! That is an interesting data point.

I do loosely hope to expand memchr some day, but it will take time. And yeah, if you need a bespoke variant anyway, it's probably best to just stick with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants