Load LLM to CPU and offload to GPU #8038

Raboo · 2026-01-14T19:34:22Z

Raboo
Jan 14, 2026

I have a NAS with a Xeon Gold 6132 and 192 GB GRAM and a Geforce RTX 2080 TI 11GB VRAM GPU.

I want to use models that doesn't fit in my GPU memory. If I load larger models it fails because I assume it is trying to fit in the GPU VRAM.

I get errors like this.

Jan 14 19:19:48 INFO  BackendLoader starting modelID="mlabonne_gemma-3-27b-it-abliterated" backend="llama-cpp" model="mlabonne_gemma-3-27b-it-abliterated-Q4_K_M.gguf"
Jan 14 19:21:13 ERROR Failed to load model modelID="mlabonne_gemma-3-27b-it-abliterated" error=failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =  backend="llama-cpp"
Jan 14 19:21:13 ERROR Stream ended with error error=failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =

Is there a way to utilize both GPU and CPU and system memory instead of only VRAM for an LLM? Or am I forced to either or?
Also how do I tell LocalAI to load into CPU instead of GPU?
At the very least I thought it would be smart enough not to load LLMs into GPU that doesn't fit GPU memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load LLM to CPU and offload to GPU #8038

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Load LLM to CPU and offload to GPU #8038

Uh oh!

Raboo Jan 14, 2026

Replies: 0 comments

Raboo
Jan 14, 2026