Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't crash on Linux machines with L4 cache #28

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dfyz
Copy link

@dfyz dfyz commented May 27, 2024

I recently found out that this is a thing when trying to run a candle program (which depends on gemm) on this machine:

# grep 'model name' /proc/cpuinfo
model name      : Intel(R) Core(TM) i7-4770R CPU @ 3.20GHz
...
# cat /sys/devices/system/cpu/cpu*/cache/index4/level
4
...
# lscpu -B -C=type,level,ways,coherency-size,one-size
TYPE        LEVEL WAYS COHERENCY-SIZE  ONE-SIZE
Data            1    8             64     32768
Instruction     1    8             64     32768
Unified         2    8             64    262144
Unified         3   12             64   6291456
Unified         4   16             64 134217728

The Linux-specific code path that probes cache sizes via lscpu and sysfs assumes that level can't be greater than 3, so without this PR anything using gemm crashes like this:

index out of bounds: the len is 3 but the index is 3

This PR fixes this by adding a guard identical to the one existing in the generic X86 cache size probing code.

(an interesting theoretical question is whether it is possible to somehow exploit this gigantic 128 MiB cache instead of ignoring it)

@sarah-quinones
Copy link
Owner

an alternative approach that would make use of the cache is doing something like let level = Ord::min(level, 3)

@dfyz
Copy link
Author

dfyz commented May 29, 2024

an alternative approach that would make use of the cache is doing something like let level = Ord::min(level, 3)

I just tried that, but it appears to be trickier than I thought at first:

  • fs::read_dir() doesn't guarantee any particular order, and on my system, .../level4 comes before .../level3. Since ties are currently resolved by the cache line size (and they are are all the same on my machine), the L3 cache "wins" and overwrites the last slot in the cache hierarchy.
  • When I worked around that by temporarily falling back to the lscpu code path (which preserves ordering), I noticed no performance improvement on a large matmul (more specifically, 8K×8K×8K f32 NN GEMM). Either I did something wrong, or the large macropanel size somehow doesn't help (I double-checked that it increased from 2736 to 8196).

Perhaps it makes sense to merge the fix for the crashes first, and then think of exploiting the L4 cache. By the way, I also added an additional commit that prevents the lscpu code path from crashing (my bad, I completely forgot about it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants