Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2D blocking in GemmEx to improve speed #153

Merged
merged 1 commit into from
Dec 30, 2023
Merged

Conversation

ahgamut
Copy link
Contributor

@ahgamut ahgamut commented Dec 30, 2023

  • threads reading A and B via global memory is slow, so we allocate some shared memory for each block and copy to that first.
  • we use dynamic shared memory, so we specify the amount required when launching the kernel.
  • we don't use shared memory for C -- each thread will calculate the values for a sub-row of C, ie a sub-matrix of size 1 x BN.
  • each block contains BM threads (enforced via assert), so each block will calculate a sub-matrix of C, of size BM x BN.
  • a BM x BN sub-matrix of C requires BM x k values from A and k x BN values from B. we can't hold all of it in shared memory, so we loop with a variable called blob, and calculate step by step, each step using BM x BK values of A and BK x BN values of B.
  • since the BM threads in each block operate on nearby values of C (and therefore nearby values of A), so we reduce the overall number of memory accesses.

CUDA has a limit on shared memory -- the BM and BN values were picked based on how much would CUDA would allow.

- threads reading A and B via global memory is slow, so we allocate some
  shared memory for each block and copy to that first.
- we use dynamic shared memory, so we specify the amount required when
  launching the kernel.
- we don't use shared memory for C -- each thread will calculate the
  values for a sub-row of C, ie a sub-matrix of size 1 x BN.
- each block contains BM threads (enforced via assert), so each block
  will calculate a sub-matrix of C, of size BM x BN.
- a BM x BN sub-matrix of C requires BM x k values from A and k x BN
  values from B. we can't hold all of it in shared memory, so we loop
  with a variable called blob, and calculate step by step, each step
  using BM x BK values of A and BK x BN values of B.
- since the BM threads in each block operate on nearby values of C (and
  therefore nearby values of A), so we reduce the overall number of
  memory accesses.

CUDA has a limit on shared memory -- the BM and BN values were picked
based on how much would CUDA would allow.
@ahgamut
Copy link
Contributor Author

ahgamut commented Dec 30, 2023

Draft PR because -- confirm output is same as before under temp = 0, and I think this same trick can be extended to the batched calls also

@ahgamut
Copy link
Contributor Author

ahgamut commented Dec 30, 2023

Reference (I'm not doing the thread-local cache stuff): https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/5_kernel_2D_blocktiling.cuh

@jart
Copy link
Collaborator

jart commented Dec 30, 2023

Outstanding. I'll wait for the PR to go green before reviewing in more detail, but it looks promising so far. This should hopefully allow us to fix #142 and it may also potentially help us solve #104 as well.

@ahgamut ahgamut marked this pull request as ready for review December 30, 2023 14:38
@ahgamut
Copy link
Contributor Author

ahgamut commented Dec 30, 2023

I tried updating the batched calls to also do the matmuls like this, but it was not stable. If we can confirm the outputs are the same before/after this PR, it can be merged.

@jart jart self-requested a review December 30, 2023 14:40
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I've confirmed that, on Windows, with a fat prebuilt GGML-CUDA + tinyBLAS DLL, LLaVA image processing goes 50% as fast as using cuBLAS. That's great news for people on Windows who want to use llamafile with their GPU without needing to go to the trouble of installing CUDA and MSVC by hand.

image

Thanks for your contribution. Any further performance improvements you can send our way would be very welcome.

@jart jart merged commit 1d9fa85 into Mozilla-Ocho:main Dec 30, 2023
1 check passed
@jart
Copy link
Collaborator

jart commented Dec 30, 2023

This performance improvement has been rolled out to Hugging Face. You can download the latest prebuilt LLaVA binaries here:

https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main

On my Windows computer, the llamafile server in tinyBLAS mode is now able to produce responses to image uploads in a few seconds. It's very usable. Even on my modest four year old NVIDIA graphics card. Text generation with tinyBLAS is still as good as always, going 50 tokens/sec for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants