2D blocking in GemmEx to improve speed #153

ahgamut · 2023-12-30T11:20:01Z

threads reading A and B via global memory is slow, so we allocate some shared memory for each block and copy to that first.
we use dynamic shared memory, so we specify the amount required when launching the kernel.
we don't use shared memory for C -- each thread will calculate the values for a sub-row of C, ie a sub-matrix of size 1 x BN.
each block contains BM threads (enforced via assert), so each block will calculate a sub-matrix of C, of size BM x BN.
a BM x BN sub-matrix of C requires BM x k values from A and k x BN values from B. we can't hold all of it in shared memory, so we loop with a variable called blob, and calculate step by step, each step using BM x BK values of A and BK x BN values of B.
since the BM threads in each block operate on nearby values of C (and therefore nearby values of A), so we reduce the overall number of memory accesses.

CUDA has a limit on shared memory -- the BM and BN values were picked based on how much would CUDA would allow.

- threads reading A and B via global memory is slow, so we allocate some shared memory for each block and copy to that first. - we use dynamic shared memory, so we specify the amount required when launching the kernel. - we don't use shared memory for C -- each thread will calculate the values for a sub-row of C, ie a sub-matrix of size 1 x BN. - each block contains BM threads (enforced via assert), so each block will calculate a sub-matrix of C, of size BM x BN. - a BM x BN sub-matrix of C requires BM x k values from A and k x BN values from B. we can't hold all of it in shared memory, so we loop with a variable called blob, and calculate step by step, each step using BM x BK values of A and BK x BN values of B. - since the BM threads in each block operate on nearby values of C (and therefore nearby values of A), so we reduce the overall number of memory accesses. CUDA has a limit on shared memory -- the BM and BN values were picked based on how much would CUDA would allow.

ahgamut · 2023-12-30T11:20:46Z

Draft PR because -- confirm output is same as before under temp = 0, and I think this same trick can be extended to the batched calls also

ahgamut · 2023-12-30T11:27:00Z

Reference (I'm not doing the thread-local cache stuff): https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/5_kernel_2D_blocktiling.cuh

jart · 2023-12-30T11:38:28Z

Outstanding. I'll wait for the PR to go green before reviewing in more detail, but it looks promising so far. This should hopefully allow us to fix #142 and it may also potentially help us solve #104 as well.

ahgamut · 2023-12-30T14:39:35Z

I tried updating the batched calls to also do the matmuls like this, but it was not stable. If we can confirm the outputs are the same before/after this PR, it can be merged.

jart

Nice. I've confirmed that, on Windows, with a fat prebuilt GGML-CUDA + tinyBLAS DLL, LLaVA image processing goes 50% as fast as using cuBLAS. That's great news for people on Windows who want to use llamafile with their GPU without needing to go to the trouble of installing CUDA and MSVC by hand.

Thanks for your contribution. Any further performance improvements you can send our way would be very welcome.

jart · 2023-12-30T15:49:20Z

This performance improvement has been rolled out to Hugging Face. You can download the latest prebuilt LLaVA binaries here:

https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main

On my Windows computer, the llamafile server in tinyBLAS mode is now able to produce responses to image uploads in a few seconds. It's very usable. Even on my modest four year old NVIDIA graphics card. Text generation with tinyBLAS is still as good as always, going 50 tokens/sec for me.

ahgamut marked this pull request as ready for review December 30, 2023 14:38

jart self-requested a review December 30, 2023 14:40

jart approved these changes Dec 30, 2023

View reviewed changes

jart merged commit 1d9fa85 into Mozilla-Ocho:main Dec 30, 2023
1 check passed

This was referenced Dec 30, 2023

tinyBLAS processes images slowly on Windows #142

Closed

LLaVA loads with errors, GUI opens but doesn't produce any output #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2D blocking in GemmEx to improve speed #153

2D blocking in GemmEx to improve speed #153

ahgamut commented Dec 30, 2023

ahgamut commented Dec 30, 2023

ahgamut commented Dec 30, 2023

jart commented Dec 30, 2023

ahgamut commented Dec 30, 2023

jart left a comment

jart commented Dec 30, 2023

2D blocking in GemmEx to improve speed #153

2D blocking in GemmEx to improve speed #153

Conversation

ahgamut commented Dec 30, 2023

ahgamut commented Dec 30, 2023

ahgamut commented Dec 30, 2023

jart commented Dec 30, 2023

ahgamut commented Dec 30, 2023

jart left a comment

Choose a reason for hiding this comment

jart commented Dec 30, 2023