Skip to content

Conversation

@ashvardanian
Copy link
Contributor

Apple chips provide several functional units capable of high-throughput matrix multiplication and AI inference. Those computeUnits include the CPU, GPU, and Neural Engine. For maximum compatibility, the .all option is used by default. Sadly, Apple's scheduler is not always optimal, and it might be beneficial to specify the target device explicitly, especially if the models are pre-compiled for the Apple Neural Engine, as it may yield significant performance gains.

Model GPU Text E. ANE Text E. GPU Image E. ANE Image E.
english-small 2.53 ms 0.53 ms 6.57 ms 1.23 ms
english-base 2.54 ms 0.61 ms 18.90 ms 3.79 ms
english-large 2.30 ms 0.61 ms 79.68 ms 20.94 ms
multilingual-base 2.34 ms 0.50 ms 18.98 ms 3.77 ms

On Apple M4 iPad, running iOS 18.2. Batch size is 1, and the model is pre-loaded into memory. The original encoders use f32 single-precision numbers for maximum compatibility, and mostly rely on GPU for computation. The quantized encoders use a mixture of i8, f16, and f32 numbers for maximum performance, and mostly rely on the Apple Neural Engine (ANE) for computation. The median latency is reported.

Co-authored-by: Kirill Solodskikh <[email protected]>
Co-authored-by: Azim Kurbanov <[email protected]>
Co-authored-by: Ruslan Aydarkhanov <[email protected]>
Co-authored-by: Andrey Ageev <[email protected]>
@ashvardanian ashvardanian merged commit 2dbcc42 into main Dec 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants