-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2D blocking in GemmEx to improve speed #153
Conversation
- threads reading A and B via global memory is slow, so we allocate some shared memory for each block and copy to that first. - we use dynamic shared memory, so we specify the amount required when launching the kernel. - we don't use shared memory for C -- each thread will calculate the values for a sub-row of C, ie a sub-matrix of size 1 x BN. - each block contains BM threads (enforced via assert), so each block will calculate a sub-matrix of C, of size BM x BN. - a BM x BN sub-matrix of C requires BM x k values from A and k x BN values from B. we can't hold all of it in shared memory, so we loop with a variable called blob, and calculate step by step, each step using BM x BK values of A and BK x BN values of B. - since the BM threads in each block operate on nearby values of C (and therefore nearby values of A), so we reduce the overall number of memory accesses. CUDA has a limit on shared memory -- the BM and BN values were picked based on how much would CUDA would allow.
Draft PR because -- confirm output is same as before under |
Reference (I'm not doing the thread-local cache stuff): https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/5_kernel_2D_blocktiling.cuh |
I tried updating the batched calls to also do the matmuls like this, but it was not stable. If we can confirm the outputs are the same before/after this PR, it can be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I've confirmed that, on Windows, with a fat prebuilt GGML-CUDA + tinyBLAS DLL, LLaVA image processing goes 50% as fast as using cuBLAS. That's great news for people on Windows who want to use llamafile with their GPU without needing to go to the trouble of installing CUDA and MSVC by hand.
Thanks for your contribution. Any further performance improvements you can send our way would be very welcome.
This performance improvement has been rolled out to Hugging Face. You can download the latest prebuilt LLaVA binaries here: https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main On my Windows computer, the llamafile server in tinyBLAS mode is now able to produce responses to image uploads in a few seconds. It's very usable. Even on my modest four year old NVIDIA graphics card. Text generation with tinyBLAS is still as good as always, going 50 tokens/sec for me. |
CUDA has a limit on shared memory -- the BM and BN values were picked based on how much would CUDA would allow.