RFC: Add memchr8 based on candidate phase of Teddy #192

adamreichold · 2025-10-13T13:25:24Z

Obviously not a fully fleshed out implementation, but enough to run unit tests and ideally decide whether this is worth pursuing.

adamreichold · 2025-10-13T13:26:05Z

I am also not sure how to integrate this into the existing benchmarks. Ideally I would want to compare the SIMD variants against the fallback to determine whether this is worth it?

BurntSushi · 2025-10-13T14:43:25Z

Thanks! I'm not sure this is really within my bandwidth to maintain right now. A memchr8 without memchr{4,5,6,7} is pretty odd. So adding this seems likely to commit me to supporting those too. And having all of those just seems like not a great user experience overall and results in added maintenance burden. And because it will substantially increase the size of the crate, they will also need to be opt-in features. And that in turn makes testing more annoying. Moreover, I perceive these as very niche APIs. The needles likely need to be quite infrequent in the haystack for something like this to be worth it. Finally, it's not clear to me that it's worth adding this when one can just use aho-corasick. I'm guessing you found this to be faster, but I'm not clear on that.

adamreichold · 2025-10-13T15:10:15Z

A memchr8 without memchr{4,5,6,7} is pretty odd.

Note that this is actually a memchr012345678, i.e. it supports any number of needles up to and including 8. Of course, naming things is hard. But I think this should mitigate a lot of the concerns about crate size, testing, etc.?

Finally, it's not clear to me that it's worth adding this when one can just use aho-corasick. I'm guessing you found this to be faster, but I'm not clear on that.

I did try this over at robinson and the additional machinery for supporting patterns longer than one byte did indeed have significant overhead. Admittedly, I did not directly compare them, it was just that aho-corasick did not improve upon the scalar implementation whereas this code did.

The needles likely need to be quite infrequent in the haystack for something like this to be worth it.

Could you expand on your reasoning here?

Considering that according to Intel's intrinsics documentation e.g. _mm256_cmpeg_epi8 and _mm256_shuffle_eip8 appear to have the same basic performance characteristics (1 cycle latency, 0.5 cycles-per-instruction throughput), I would actually wonder how this implementation of memchr8 compares to memchr3 when used with three needles.

adamreichold · 2025-10-13T15:11:12Z

I would actually wonder how this implementation of memchr8 compares to memchr3 when used for with three needles.

At least when building the look-up tables is either constant-folded or amortised.

BurntSushi · 2025-10-13T15:44:29Z

Could you expand on your reasoning here?

Processing more matches means more overhead. memchr has smaller overhead than memchr2. And so on. As you add more needles, the likelihood of matches increases generally. The regex crate makes a lot of heuristic assumptions around this for its literal optimizations.

I did miss that this covers all needle variations. But I really just don't have the review bandwidth for something new like this. It needs to be considered holistically in the memchr API.

Moreover, I would definitely want to understand why this is better than aho-corasick's Teddy implementation. I think abo-corasick specializes for each needle length? So I would want to see if that could be made faster first.

Considering that according to Intel's intrinsics documentation e.g. _mm256_cmpeg_epi8 and _mm256_shuffle_eip8 appear to have the same basic performance characteristics (1 cycle latency, 0.5 cycles-per-instruction throughput), I would actually wonder how this implementation of memchr8 compares to memchr3 when used for with three needles.

Yeah that would be quite interesting! This crate uses rebar for benchmarks and I think you could probably bake them off within that framework.

adamreichold · 2025-10-13T16:36:04Z

Yeah that would be quite interesting! This crate uses rebar for benchmarks and I think you could probably bake them off within that framework.

These are the results where "eight" means three needles using the Eight instead of the Three implementation. I also directly called the AVX2 variants (in both cases) to enable constant-folding of the look-up tables for the slice-based variant.

benchmark                        rust/memchr/memchr3  rust/memchr/memchr3/eight  rust/memchr/memchr3/fallback  rust/memchr/memchr3/naive
---------                        -------------------  -------------------------  ----------------------------  -------------------------
memchr/sherlock/common/huge3     629.6 MB/s (1.00x)   530.3 MB/s (1.19x)         208.6 MB/s (3.02x)            339.7 MB/s (1.85x)
memchr/sherlock/common/small3    666.6 MB/s (1.20x)   545.9 MB/s (1.46x)         695.1 MB/s (1.15x)            799.5 MB/s (1.00x)
memchr/sherlock/never/huge3      39.9 GB/s (1.00x)    31.6 GB/s (1.27x)          3.6 GB/s (11.23x)             1071.7 MB/s (38.17x)
memchr/sherlock/never/small3     12.4 GB/s (1.00x)    6.2 GB/s (2.00x)           2.9 GB/s (4.20x)              1018.1 MB/s (12.44x)
memchr/sherlock/never/tiny3      1605.0 MB/s (1.00x)  731.2 MB/s (2.20x)         1096.7 MB/s (1.46x)           731.2 MB/s (2.20x)
memchr/sherlock/never/empty3     40.00ns (1.00x)      60.00ns (1.50x)            40.00ns (1.00x)               40.00ns (1.00x)
memchr/sherlock/rare/huge3       26.8 GB/s (1.00x)    21.3 GB/s (1.26x)          3.3 GB/s (8.06x)              1060.9 MB/s (25.85x)
memchr/sherlock/rare/small3      8.7 GB/s (1.00x)     4.8 GB/s (1.83x)           2.7 GB/s (3.24x)              972.7 MB/s (9.17x)
memchr/sherlock/rare/tiny3       1096.7 MB/s (1.00x)  598.2 MB/s (1.83x)         940.1 MB/s (1.17x)            658.0 MB/s (1.67x)
memchr/sherlock/uncommon/huge3   2.0 GB/s (1.00x)     1944.3 MB/s (1.08x)        844.4 MB/s (2.49x)            770.2 MB/s (2.72x)
memchr/sherlock/uncommon/small3  2.6 GB/s (1.00x)     1972.7 MB/s (1.34x)        1809.3 MB/s (1.46x)           770.4 MB/s (3.43x)
memchr/sherlock/uncommon/tiny3   438.7 MB/s (1.50x)   299.1 MB/s (2.20x)         548.4 MB/s (1.20x)            658.0 MB/s (1.00x)

So my take would be that it is certainly slower, but also comes close at times and usually lands in between the the direct AVX2 and the fallback/SWAR verions.

The "empty" case is much slower though (40 versus 60 ns) which might still point at higher searcher construction cost. From counting intrinsics/instructions, I would guess the additional shifting and masking to get separate nibbles also adds extra costs, so it might look better against a hypothetical memchr4.

In any case,

But I really just don't have the review bandwidth for something new like this.

so let's leave it at that. (I would have to maintain my bespoke version in any case, as I really need its "7+1" semantics - find any of =/> \t\r\n but also mark any : before that - which this API would not provide. Without doing both things in a single pass, I still end up slower than the scalar version for typical XML tag names.)

BurntSushi · 2025-10-13T16:41:37Z

Thanks for following up! That is an interesting data point.

I do loosely hope to expand memchr some day, but it will take time. And yeah, if you need a bespoke variant anyway, it's probably best to just stick with that.

WIP: Add memchr8 based on candidate phase of Teddy

6517556

adamreichold closed this Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Add memchr8 based on candidate phase of Teddy #192

RFC: Add memchr8 based on candidate phase of Teddy #192

Uh oh!

adamreichold commented Oct 13, 2025

Uh oh!

adamreichold commented Oct 13, 2025

Uh oh!

BurntSushi commented Oct 13, 2025

Uh oh!

adamreichold commented Oct 13, 2025 •

edited

Loading

Uh oh!

adamreichold commented Oct 13, 2025

Uh oh!

BurntSushi commented Oct 13, 2025

Uh oh!

adamreichold commented Oct 13, 2025 •

edited

Loading

Uh oh!

BurntSushi commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

RFC: Add memchr8 based on candidate phase of Teddy #192

RFC: Add memchr8 based on candidate phase of Teddy #192

Uh oh!

Conversation

adamreichold commented Oct 13, 2025

Uh oh!

adamreichold commented Oct 13, 2025

Uh oh!

BurntSushi commented Oct 13, 2025

Uh oh!

adamreichold commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamreichold commented Oct 13, 2025

Uh oh!

BurntSushi commented Oct 13, 2025

Uh oh!

adamreichold commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BurntSushi commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adamreichold commented Oct 13, 2025 •

edited

Loading

adamreichold commented Oct 13, 2025 •

edited

Loading