You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a NAS with a Xeon Gold 6132 and 192 GB GRAM and a Geforce RTX 2080 TI 11GB VRAM GPU.
I want to use models that doesn't fit in my GPU memory. If I load larger models it fails because I assume it is trying to fit in the GPU VRAM.
I get errors like this.
Jan 14 19:19:48 INFO BackendLoader starting modelID="mlabonne_gemma-3-27b-it-abliterated" backend="llama-cpp" model="mlabonne_gemma-3-27b-it-abliterated-Q4_K_M.gguf"
Jan 14 19:21:13 ERROR Failed to load model modelID="mlabonne_gemma-3-27b-it-abliterated" error=failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = backend="llama-cpp"
Jan 14 19:21:13 ERROR Stream ended with error error=failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
Is there a way to utilize both GPU and CPU and system memory instead of only VRAM for an LLM? Or am I forced to either or?
Also how do I tell LocalAI to load into CPU instead of GPU?
At the very least I thought it would be smart enough not to load LLMs into GPU that doesn't fit GPU memory.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I have a NAS with a Xeon Gold 6132 and 192 GB GRAM and a Geforce RTX 2080 TI 11GB VRAM GPU.
I want to use models that doesn't fit in my GPU memory. If I load larger models it fails because I assume it is trying to fit in the GPU VRAM.
I get errors like this.
Is there a way to utilize both GPU and CPU and system memory instead of only VRAM for an LLM? Or am I forced to either or?
Also how do I tell LocalAI to load into CPU instead of GPU?
At the very least I thought it would be smart enough not to load LLMs into GPU that doesn't fit GPU memory.
Beta Was this translation helpful? Give feedback.
All reactions