A conversational AI system with voice + text input, powered by Ollama (local LLM), Whisper speech-to-text, and XTTS v2 text-to-speech.
- 🐍 Python 3.8+
- 🎵 FFmpeg (for audio processing)
- 🦙 Ollama (for local LLM serving)
- 🎤 A working microphone (for voice input)
- Clone the repo
git clone <repository-url>
cd Convo-Ai
- Create a virtual environment
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
# Windows
python -m venv venv
.�env\Scripts�ctivate
- Install dependencies
pip install -r requirements.txt
- Install FFmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows (Chocolatey)
choco install ffmpeg
- Install & start Ollama
# Download: https://ollama.ai/download
ollama serve
- Start the server
source venv/bin/activate # or .\venv\Scripts\activate on Windows
python server.py
- Run the client
source venv/bin/activate
python talk.py
- Choose input mode
- 🎤 Voice → Press
r
to start/stop recording - ⌨ Text → Type and press Enter
- 🎤 Voice and text input modes
- 🤖 Natural language processing with Ollama
- 🔊 Text-to-speech output (XTTS v2)
- 📝 Conversation history & session logging
- 🎭 Mood analysis
- 🌐 Optional web interface (FastAPI + WebSocket)
- 🔒 Privacy-first: all processing runs locally
Edit config.json
to adjust:
- Voice model
- Speed & pitch
- WebSocket URL
- LLM settings
- Backend → Python, FastAPI
- Speech-to-Text → OpenAI Whisper
- LLM → Ollama (local)
- TTS → XTTS v2
- Frontend → HTML + JavaScript
- Realtime → WebSocket
convo-ai-isolated/
├── src/
│ ├── server.py
│ └── talk.py
├── templates/
│ └── index.html
├── static/
├── logs/
├── tts_cache/
├── config.json
├── requirements.txt
├── README.md
└── HOWTO.md
- ✅ Virtual environment activated?
- ✅ Dependencies installed?
- ✅ FFmpeg installed?
- ✅ Ollama running?
- ✅ Microphone permissions granted?
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature
) - Commit (
git commit -m 'Add some AmazingFeature'
) - Push (
git push origin feature/AmazingFeature
) - Open a Pull Request