Improve hc_matchfinder_longest_match() performance on Apple Silicon #284

andrews05 · 2022-12-31T00:51:35Z

I was comparing performance on two MacBooks and was surprised at some of the results.
I used the included benchmark program to compare the following hardware:
2015 MacBook Pro: 2.2 GHz Quad-Core Intel Core i7
2021 MacBook Pro: 8-core Apple M1 Pro

Level	Intel	M1 Pro	Difference
1	70ms	43ms	63%
2	107ms	59ms	81%
3	103ms	60ms	72%
4	108ms	61ms	77%
5	112ms	64ms	75%
6	120ms	74ms	62%
7	157ms	110ms	43%
8	316ms	270ms	17%
9	443ms	429ms	3%
10	2062ms	1340ms	54%
11	4481ms	3142ms	43%
12	9192ms	7309ms	26%

At level 9, M1 Pro is only 3% faster than a 6-year older Intel!
I tried profiling it with xctrace and as best I can tell the performance hit comes from load_u32_unaligned (I can attach the trace output if that would be helpful). I can confirm that UNALIGNED_ACCESS_IS_FAST is set, but beyond that I haven't been able to work out why there's an issue.
Do you have any ideas, or is it really the case that the Intel hardware is simply better at this?

The text was updated successfully, but these errors were encountered:

ebiggers · 2022-12-31T03:11:35Z

At level 9 the bottleneck is usually the inner loop in hc_matchfinder_longest_match(). This is one very specific thing, so it is plausible that a CPU that is faster in general is not faster at this specific thing. This inner loop is also not very friendly to CPUs in general, as it doesn't allow for much instruction pipelining. But it's hard to improve.

andrews05 · 2022-12-31T03:51:05Z

Yeah, it is hc_matchfinder_longest_match, but load_u32_unaligned seems to be taking around 1/3rd of the time within that function on the M1.

Intel:

M1:

ebiggers · 2022-12-31T04:03:40Z

It doesn't really make sense to talk about just the load_u32_unaligned part of hc_matchfinder_longest_match, since load_u32_unaligned is always inlined and compiles down to a single load. The fact that you are seeing a difference in the time accounted to load_u32_unaligned is probably just a side effect of whatever profiling tool you are using.

OlliV · 2024-10-28T10:18:02Z

I ran a test and can can confirm that having UNALIGNED_ACCESS_IS_FAST set is definitely faster than having it set to 0. xctrace actually marks the function as inlined. I'm not sure why a debug symbol is even produced for this function, because my experience is that usually it wouldn't happen.

ebiggers added the enhancement label Dec 31, 2022

ebiggers changed the title ~~Comparatively poor performance on Apple Silicon~~ Comparatively poor compression performance on Apple Silicon Jan 10, 2023

ebiggers changed the title ~~Comparatively poor compression performance on Apple Silicon~~ Improve hc_matchfinder_longest_match() performance on Apple Silicon Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve hc_matchfinder_longest_match() performance on Apple Silicon #284

Improve hc_matchfinder_longest_match() performance on Apple Silicon #284

andrews05 commented Dec 31, 2022 •

edited

Loading

ebiggers commented Dec 31, 2022

andrews05 commented Dec 31, 2022

ebiggers commented Dec 31, 2022

OlliV commented Oct 28, 2024

Improve hc_matchfinder_longest_match() performance on Apple Silicon #284

Improve hc_matchfinder_longest_match() performance on Apple Silicon #284

Comments

andrews05 commented Dec 31, 2022 • edited Loading

ebiggers commented Dec 31, 2022

andrews05 commented Dec 31, 2022

ebiggers commented Dec 31, 2022

OlliV commented Oct 28, 2024

andrews05 commented Dec 31, 2022 •

edited

Loading