A local, GPU-efficient multimodal chatbot built around MiniCPM-o-4.5, supporting text, image understanding, voice input, offline text-to-speech, and optional image generation — all runnable on a single 12GB GPU.
This project focuses on practical multimodal interaction while keeping memory usage under control through careful model loading, CPU/GPU separation, and optional components.
-
No external APIs required (fully local)
-
Text chat with MiniCPM-o-4.5
-
Speech-to-text (offline, CPU-based Whisper)
-
Text-to-speech (offline Piper TTS)
-
Audio reversal & volume control
-
Image Generation with SDXL normal, turboXL, or turbo
-
Img-Img pipelines
-
Image understanding (vision + language)
-
12GB-GPU friendly design
- MiniCPM-o-4.5 (INT4)
- Runs on GPU via device_map="auto"
- Speech-to-Text
- faster-whisper on CPU (keeps GPU free)
- Text-to-Speech
- Piper (offline ONNX voice model)
- Image Generation
- Stable Diffusion Normal, TurboXL, or Turbo
- Loaded on CPU and moved to GPU only during inference
- UI
- Gradio Blocks interface
This separation allows smooth interaction even on consumer GPUs.
- NVIDIA GPU with ~12GB VRAM (tested)
- CPU capable of running Whisper + Piper
Software
- Python 3.10+
- CUDA-enabled PyTorch
- Linux recommended (WSL2 works well)
Note: If this hangs, press CTRL+C and rerun.
wsl --install -d Ubuntu-24.04 --name AI-Sandbox
Create a User Name and Password for WSL
# 0. Create folder and clone repo
cd ~
git clone https://github.com/amill288/AI-Local-Sandbox.git
cd AI-Local-Sandbox
# 1. Create venv
sudo apt update
sudo apt install -y python3-venv
python3 -m venv .venv
source .venv/bin/activate
# 2. Install PyTorch first (CUDA-specific)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
sudo apt install -y ffmpeg
# 3. Install the rest
pip install -r requirements.txt
You’ll also need:
- A Piper voice model (ONNX)
piperavailable on your PATH
mkdir -p ~/piper_voices/libritts_r_medium
cd ~/piper_voices/libritts_r_medium
wget -O en_US-libritts_r-medium.onnx \
https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/libritts_r/medium/en_US-libritts_r-medium.onnx
wget -O en_US-libritts_r-medium.onnx.json \
https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/libritts_r/medium/en_US-libritts_r-medium.onnx.json
cd ~/AI-Local-Sandbox
Feel free to switch the wget to a different piper voice model.
Browser Address (use localhost, not 0.0.0.0)
http://localhost:7860
cd ~/AI-Local-Sandbox
python minicpm.py
cd ~/AI-Local-Sandbox
python webcam_live_gradio.py
cd ~/AI-Local-Sandbox
python sdxl_safe.py
cd ~/AI-Local-Sandbox
python sdxl.py
The first run will auto download the required safetensors, after that it will use checkpoints and run more quickly.
If something happened and you want to start over, from Windows CMD:
wsl --unregister AI-Sandbox
Then start back at the top.
- Voice input/output is fully offline
- Image generation is optional and can be disabled if VRAM is tight
- Audio playback supports:
- Volume adjustment
- Instant reversal toggle (normal ↔ reversed)
This project is licensed under the MIT License.
Important: This repository provides application code and UI logic only. Model weights and third-party tools (MiniCPM, Stable Diffusion, Whisper, Piper, etc.) are governed by their own respective licenses.
See the LICENSE file for details.
- OpenBMB — MiniCPM-V
- Hugging Face — Transformers & Diffusers
- OpenAI — Whisper (via faster-whisper)
- Gradio — UI framework
- Piper TTS — Offline speech synthesis
This project is intended for research, experimentation, and educational use. Always review the licenses of included models before using in production.
Note: This repository provides a wrapper and UI for third-party models and tools. Model weights and external dependencies are governed by their own respective licenses.