A voice-first personal assistant that feels like JARVIS — at home and on the go.
A.I.R.A. (AI Responsive Assistant) is a two-target voice assistant designed to live with you. Walk into your house and say "Hey AIRA, call mom" — the desktop hub picks up your voice and handles the call. Step outside and the same identity follows you onto a pocket-sized Raspberry Pi: hold a button, speak, get the same answers and actions. One brain, two bodies.
┌────────────────────────────────────────┐
│ Backend (FastAPI · Python · async) │
│ ┌────────────┐ ┌─────────────────┐ │
┌─────────────┐ │ │ Realtime │ │ Agent Router │ │
│ Home │ ◄───► │ │ Session │◄─┤ + Approval Svc │ │
│ (wake-word)│ WS │ │ (OpenAI) │ └────────┬────────┘ │
└─────────────┘ │ └────────────┘ │ │
│ ▼ │
┌─────────────┐ │ ┌────────────┐ ┌─────────────────┐ │
│ Pi (PTT) │ ◄───► │ │ Memory │ │ Tools │ │
│ Whisplay │ WS │ │ Service │ │ (telephony, │ │
└─────────────┘ │ │ (PG+Redis) │ │ email, search, │ │
│ └────────────┘ │ calendar) │ │
│ └─────────────────┘ │
└────────────────────────────────────────┘
Most voice assistants are either toys (smart speakers that can't do much) or hostile (locked into one ecosystem, recording everything, surfacing ads). A.I.R.A. is built around three beliefs:
- Voice should be the primary interface, not a feature. Sub-second latency, natural turn-taking, and interruption handling — anything slower kills the loop.
- External actions need explicit consent. Placing a call, sending an email, or spending money requires a verbal "yes" — every time, with a 30-second window. No surprise actions.
- It should follow you. The home device and the portable Pi share one identity, one memory, one set of contacts. Continue a conversation across rooms or across town.
It's a personal project — not a product, not a startup. The goal is to ship a JARVIS that feels like JARVIS: fast, deferential, deeply integrated with the APIs I actually use.
- Activation — wake-word
"hey aira"(home) or button press (Pi). - Voice capture streams to the backend over a WebSocket as 24 kHz PCM frames.
- OpenAI Realtime API transcribes, reasons, and decides whether to call a tool.
- Agent Router validates the tool call, classifies its safety tier, and routes it.
- Approval Service intercepts Tier 2 actions (call, email) and asks for verbal confirmation.
- Tool Executor invokes Telnyx / Gmail / Maps / Calendar with retry, rate limiting, and timeouts.
- Memory Service persists the result, the context, and any relevant updates.
- TTS response streams back to the device and plays through the speaker.
Typical time-to-first-audio over home wifi: ~350-500ms.
Every action is classified up front; the tier determines whether it can run automatically.
| Tier | Examples | Policy |
|---|---|---|
| 0 — Read | Look up a business, check the weather | Allow |
| 1 — Draft | Compose an email, generate a call script | Allow + log |
| 2 — Action | Place a call, send an email | Require verbal approval |
| 3 — High risk | Spend money, bulk operations | Blocked in v1 |
Approval prompts in ambient (wake-word) mode require a strict "yes" — no fuzzy matching — because the cost of a false positive in an always-on environment is higher than a missed intent.
A desktop or mini-PC (currently targeting an old laptop or NVIDIA DGX Spark) running the
device client with a far-field USB mic array and a powered speaker. Wake on "hey aira",
respond in conversation. The wake-word model is custom-trained via openWakeWord's
TTS-augmented pipeline since "hey aira" isn't a stock keyword.
A Raspberry Pi 5 with the PiSugar Whisplay HAT (integrated LCD, mic, speaker, button, battery). Tethered to a phone hotspot when out of the house. Press to talk, release to send — short presses (<500ms) are rejected as accidental, matching the proven gesture from PiSugar/whisplay-ai-chatbot.
The activation layer is a Protocol-based abstraction
(activation.py), so any new
hardware (mobile app, smart watch, kiosk) is a single class away.
| Layer | Choice | Why |
|---|---|---|
| Voice model | OpenAI Realtime API | Best speech-to-speech latency + native function calling. No open-weight model matches it on the full bundle (latency + reasoning + tool calls + interruption handling) as of early 2026. |
| Wake word | openWakeWord | Open source, runs on CPU, supports custom training. |
| Telephony | Telnyx | ~60% cheaper than Twilio for the same call quality. |
| Gmail API | User-owned accounts, OAuth, no SMTP relay. | |
| Backend | FastAPI + asyncpg + Redis | Async all the way down. Predictable latency under load. |
| Database | PostgreSQL 16 (Alembic migrations) | Honest persistence. Migration history not stub schemas. |
| Cache | Redis 7 | Rate limit counters, session locks, token cache. |
| Workspace | uv + pyproject monorepo | Cross-package editable installs without setup.py rituals. |
| Logs | structlog (JSON) | Searchable, parseable, no ad-hoc print debugging. |
I evaluated running a local voice-to-voice model on the DGX Spark — Moshi, Llama-Omni, NVIDIA Nemotron-Audio, etc. The honest finding: no open model matches GPT-4o Realtime on the full bundle of latency + reasoning + tool calls + interruption handling as of early 2026. The closest on latency (Moshi) lags badly on reasoning and lacks function calling, which the entire router/approval flow depends on.
The realtime-session package is built as "OpenAI + fallback" so the backend can be swapped
without architectural rework. Local revisits in 6-12 months when the open-weight space
catches up.
apps/
├── backend-api/ # FastAPI service, WebSocket session, auth, metrics
└── device-client/ # Edge client: audio I/O, activation, status display
└── hardware/ # Pi GPIO button (gpiod), Whisplay LCD/LED (planned)
packages/
├── realtime-session/ # OpenAI Realtime adapter + state machine + cost guard
├── agent-router/ # Intent → tool → workflow with safety-tier classification
├── approval-service/ # Verbal approval flow, allow/blocklist, 30s timeout
├── memory-service/ # Users, contacts, preferences, conversation history
├── tools-core/ # Tool registry + executor (retry, rate limit, quotas)
├── tools-telephony/ # Telnyx adapter
├── tools-email/ # Gmail adapter
├── tools-search/ # Google Maps adapter
├── tools-calendar/ # Google Calendar adapter
└── shared/ # Types, audit logger, common utilities
infra/
├── docker/ # docker-compose for Postgres + Redis (dev)
└── terraform/ # Cloud provisioning (planned)
- Python 3.11+
- uv package manager
- Docker (for local Postgres + Redis), or your own Postgres 15+ / Redis 7+
- An OpenAI API key with Realtime API access
git clone https://github.com/Alex0420W/A.I.R.A.git
cd A.I.R.A
# Bring up Postgres + Redis
docker compose -f infra/docker/docker-compose.yml up -d
# Install all packages in editable mode
uv sync --all-extras
# Configure secrets
cp .env.example .env
# Edit .env: at minimum set OPENAI_API_KEY
# Apply database migrations
uv run alembic upgrade head
# Start the backend
uv run python run-api.pyThe API comes up at http://localhost:8000. Hit /health to confirm Postgres, Redis, and
OpenAI credentials are wired correctly.
# In another terminal
uv run python -m aira_device_clientDefault activation is wake-word. Override per device with environment variables — useful when you want the Pi to default to button mode and the home machine to default to wake-word:
# Pi
export AIRA_DEFAULT_ACTIVATION_MODE=button
export AIRA_DEVICE_ID=pi-pocket-01
# Home
export AIRA_DEFAULT_ACTIVATION_MODE=wake_word
export AIRA_WAKE_WORD=hey_aira
export AIRA_DEVICE_ID=home-hub-01# On the Raspberry Pi only — adds gpiod for the Whisplay HAT button
uv sync --extra piThis is a personal-scale project under active development. The honest current state:
| Area | State |
|---|---|
| Backend API | Wired end-to-end. Health, metrics, WebSocket session, auth middleware, encryption. |
| Realtime session | OpenAI Realtime adapter, state machine, cost guard, audio buffering. |
| Agent router | Tool registry, safety-tier classification, parameter validation. |
| Approval service | Verbal approval flow, 30s timeout, allow/blocklist. |
| Memory service | Postgres schema, contacts, preferences, conversation history (Alembic migrations). |
| Tools | Telephony, email, search, calendar — all implemented against real APIs (not yet live-tested end-to-end). |
| Device client | Audio I/O, WebSocket reconnection, status indicators, press-to-talk + wake-word activation. |
| Pi GPIO | GpioButtonHandler (gpiod, BCM 17, <500ms reject). Pending hardware to test. |
| Whisplay LCD | Not yet ported. StatusManager is abstracted to plug in. |
| Wake word | openWakeWord integrated; "hey aira" custom model not yet trained. |
| Tests | 56 passing. Structural coverage; integration tests are the next priority. |
The code is ~95% scaffolded, the testing is ~30% there. Shipping path is: train wake word →
live-test the OpenAI path on home hardware → wire and verify backend audio_commit /
audio_cancel routing → end-to-end voice test → port Whisplay LCD when Pi hardware arrives.
v1 — Magic loop
- Custom-trained
hey_airawake-word model - Live end-to-end test: home wifi, real OpenAI Realtime, real Telnyx call placed by voice
- Pi build with Whisplay HAT (button + LCD + RGB feedback)
- Backend
audio_commit/audio_cancelrouting - Integration tests covering the full voice → tool → response loop
v2 — Personalization
- Contact disambiguation that actually learns ("call mom" → which mom?)
- Multi-user voice ID
- Per-device preferences (Pi defaults vs. home defaults)
- Conversation memory pruning that respects what's worth remembering
v3 — Connectors
- Calendar event creation from voice
- Note-taking and reminder workflows
- Voice-driven home automation (Home Assistant bridge)
# Run tests
uv run pytest
# Lint
uv run ruff check .
# Format
uv run ruff format .
# Type check
uv run mypy .
# Pre-commit hooks
uv run pre-commit install- OpenAI for the Realtime API
- Telnyx for telephony
- openWakeWord for open-source wake-word detection
- PiSugar/whisplay-ai-chatbot — the reference design the Pi build is modeled on
MIT — see LICENSE.