Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: Support for GPU Acceleration on Newer Qualcomm Devices #196

Open
shadow3aaa opened this issue Feb 2, 2025 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@shadow3aaa
Copy link

shadow3aaa commented Feb 2, 2025

Description:
I would like to request the addition of GPU acceleration support for Qualcomm devices using the custom version of llama.cpp. This enhancement could significantly improve the performance of the PocketPal AI application on devices with Qualcomm Adreno GPUs.

Reference:
For more details, please refer to the Qualcomm developer blog post: Introducing the new OpenCL GPU backend for llama.cpp

Benefits:

  • Enhanced performance on Qualcomm devices
  • Better utilization of device hardware capabilities
  • Potential for more responsive and efficient AI processing on mobile devices

Thank you for considering this feature request.

@shadow3aaa shadow3aaa added the enhancement New feature or request label Feb 2, 2025
@shadow3aaa
Copy link
Author

My test results (tested on Snapdragon 8 Elite platform):

  • GPU Inference
OP5D0DL1:/data/local/tmp $ ./bin/llama-cli -m ggml-model-qwen1.5-7b-chat-Q4_0.gguf -b 128 -ngl 99 -c 2048 -p "Hello"

llama_perf_sampler_print:    sampling time =      24.03 ms /    46 runs   (    0.52 ms per token,  1914.51 tokens per second)
llama_perf_context_print:        load time =   11765.90 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    5038.54 ms /    45 runs   (  111.97 ms per token,     8.93 tokens per second)
llama_perf_context_print:       total time =    5120.83 ms /    46 tokens
  • CPU Inference
llama_perf_sampler_print:    sampling time =       1.34 ms /    13 runs   (    0.10 ms per token,  9708.74 tokens per second)
llama_perf_context_print:        load time =    4218.12 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    6745.14 ms /    12 runs   (  562.10 ms per token,     1.78 tokens per second)
llama_perf_context_print:       total time =    8294.89 ms /    13 tokens

The GPU-accelerated llama.cpp inference speed is approximately 5 times faster than CPU inference on my device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant