Skip to content

Alex0420W/A.I.R.A

Repository files navigation

A.I.R.A.

A voice-first personal assistant that feels like JARVIS — at home and on the go.

Python FastAPI OpenAI Realtime License: MIT

A.I.R.A. (AI Responsive Assistant) is a two-target voice assistant designed to live with you. Walk into your house and say "Hey AIRA, call mom" — the desktop hub picks up your voice and handles the call. Step outside and the same identity follows you onto a pocket-sized Raspberry Pi: hold a button, speak, get the same answers and actions. One brain, two bodies.

                         ┌────────────────────────────────────────┐
                         │  Backend (FastAPI · Python · async)    │
                         │  ┌────────────┐  ┌─────────────────┐   │
   ┌─────────────┐       │  │  Realtime  │  │ Agent Router    │   │
   │  Home       │ ◄───► │  │  Session   │◄─┤ + Approval Svc  │   │
   │  (wake-word)│  WS   │  │ (OpenAI)   │  └────────┬────────┘   │
   └─────────────┘       │  └────────────┘           │            │
                         │                           ▼            │
   ┌─────────────┐       │  ┌────────────┐  ┌─────────────────┐   │
   │  Pi (PTT)   │ ◄───► │  │  Memory    │  │ Tools           │   │
   │  Whisplay   │  WS   │  │  Service   │  │ (telephony,     │   │
   └─────────────┘       │  │ (PG+Redis) │  │  email, search, │   │
                         │  └────────────┘  │  calendar)      │   │
                         │                  └─────────────────┘   │
                         └────────────────────────────────────────┘

Why this exists

Most voice assistants are either toys (smart speakers that can't do much) or hostile (locked into one ecosystem, recording everything, surfacing ads). A.I.R.A. is built around three beliefs:

  1. Voice should be the primary interface, not a feature. Sub-second latency, natural turn-taking, and interruption handling — anything slower kills the loop.
  2. External actions need explicit consent. Placing a call, sending an email, or spending money requires a verbal "yes" — every time, with a 30-second window. No surprise actions.
  3. It should follow you. The home device and the portable Pi share one identity, one memory, one set of contacts. Continue a conversation across rooms or across town.

It's a personal project — not a product, not a startup. The goal is to ship a JARVIS that feels like JARVIS: fast, deferential, deeply integrated with the APIs I actually use.


How it works

A turn, end-to-end

  1. Activation — wake-word "hey aira" (home) or button press (Pi).
  2. Voice capture streams to the backend over a WebSocket as 24 kHz PCM frames.
  3. OpenAI Realtime API transcribes, reasons, and decides whether to call a tool.
  4. Agent Router validates the tool call, classifies its safety tier, and routes it.
  5. Approval Service intercepts Tier 2 actions (call, email) and asks for verbal confirmation.
  6. Tool Executor invokes Telnyx / Gmail / Maps / Calendar with retry, rate limiting, and timeouts.
  7. Memory Service persists the result, the context, and any relevant updates.
  8. TTS response streams back to the device and plays through the speaker.

Typical time-to-first-audio over home wifi: ~350-500ms.

Safety tiers

Every action is classified up front; the tier determines whether it can run automatically.

Tier Examples Policy
0 — Read Look up a business, check the weather Allow
1 — Draft Compose an email, generate a call script Allow + log
2 — Action Place a call, send an email Require verbal approval
3 — High risk Spend money, bulk operations Blocked in v1

Approval prompts in ambient (wake-word) mode require a strict "yes" — no fuzzy matching — because the cost of a false positive in an always-on environment is higher than a missed intent.


The two devices

Home — ambient wake-word

A desktop or mini-PC (currently targeting an old laptop or NVIDIA DGX Spark) running the device client with a far-field USB mic array and a powered speaker. Wake on "hey aira", respond in conversation. The wake-word model is custom-trained via openWakeWord's TTS-augmented pipeline since "hey aira" isn't a stock keyword.

Pocket Pi — press-to-talk

A Raspberry Pi 5 with the PiSugar Whisplay HAT (integrated LCD, mic, speaker, button, battery). Tethered to a phone hotspot when out of the house. Press to talk, release to send — short presses (<500ms) are rejected as accidental, matching the proven gesture from PiSugar/whisplay-ai-chatbot.

The activation layer is a Protocol-based abstraction (activation.py), so any new hardware (mobile app, smart watch, kiosk) is a single class away.


Tech stack and decisions

Layer Choice Why
Voice model OpenAI Realtime API Best speech-to-speech latency + native function calling. No open-weight model matches it on the full bundle (latency + reasoning + tool calls + interruption handling) as of early 2026.
Wake word openWakeWord Open source, runs on CPU, supports custom training.
Telephony Telnyx ~60% cheaper than Twilio for the same call quality.
Email Gmail API User-owned accounts, OAuth, no SMTP relay.
Backend FastAPI + asyncpg + Redis Async all the way down. Predictable latency under load.
Database PostgreSQL 16 (Alembic migrations) Honest persistence. Migration history not stub schemas.
Cache Redis 7 Rate limit counters, session locks, token cache.
Workspace uv + pyproject monorepo Cross-package editable installs without setup.py rituals.
Logs structlog (JSON) Searchable, parseable, no ad-hoc print debugging.

The cloud-vs-local question

I evaluated running a local voice-to-voice model on the DGX Spark — Moshi, Llama-Omni, NVIDIA Nemotron-Audio, etc. The honest finding: no open model matches GPT-4o Realtime on the full bundle of latency + reasoning + tool calls + interruption handling as of early 2026. The closest on latency (Moshi) lags badly on reasoning and lacks function calling, which the entire router/approval flow depends on.

The realtime-session package is built as "OpenAI + fallback" so the backend can be swapped without architectural rework. Local revisits in 6-12 months when the open-weight space catches up.


Project structure

apps/
├── backend-api/          # FastAPI service, WebSocket session, auth, metrics
└── device-client/        # Edge client: audio I/O, activation, status display
    └── hardware/         # Pi GPIO button (gpiod), Whisplay LCD/LED (planned)

packages/
├── realtime-session/     # OpenAI Realtime adapter + state machine + cost guard
├── agent-router/         # Intent → tool → workflow with safety-tier classification
├── approval-service/     # Verbal approval flow, allow/blocklist, 30s timeout
├── memory-service/       # Users, contacts, preferences, conversation history
├── tools-core/           # Tool registry + executor (retry, rate limit, quotas)
├── tools-telephony/      # Telnyx adapter
├── tools-email/          # Gmail adapter
├── tools-search/         # Google Maps adapter
├── tools-calendar/       # Google Calendar adapter
└── shared/               # Types, audit logger, common utilities

infra/
├── docker/               # docker-compose for Postgres + Redis (dev)
└── terraform/            # Cloud provisioning (planned)

Quickstart

Prerequisites

  • Python 3.11+
  • uv package manager
  • Docker (for local Postgres + Redis), or your own Postgres 15+ / Redis 7+
  • An OpenAI API key with Realtime API access

Setup

git clone https://github.com/Alex0420W/A.I.R.A.git
cd A.I.R.A

# Bring up Postgres + Redis
docker compose -f infra/docker/docker-compose.yml up -d

# Install all packages in editable mode
uv sync --all-extras

# Configure secrets
cp .env.example .env
# Edit .env: at minimum set OPENAI_API_KEY

# Apply database migrations
uv run alembic upgrade head

# Start the backend
uv run python run-api.py

The API comes up at http://localhost:8000. Hit /health to confirm Postgres, Redis, and OpenAI credentials are wired correctly.

Run the device client

# In another terminal
uv run python -m aira_device_client

Default activation is wake-word. Override per device with environment variables — useful when you want the Pi to default to button mode and the home machine to default to wake-word:

# Pi
export AIRA_DEFAULT_ACTIVATION_MODE=button
export AIRA_DEVICE_ID=pi-pocket-01

# Home
export AIRA_DEFAULT_ACTIVATION_MODE=wake_word
export AIRA_WAKE_WORD=hey_aira
export AIRA_DEVICE_ID=home-hub-01

Pi-only setup

# On the Raspberry Pi only — adds gpiod for the Whisplay HAT button
uv sync --extra pi

Status

This is a personal-scale project under active development. The honest current state:

Area State
Backend API Wired end-to-end. Health, metrics, WebSocket session, auth middleware, encryption.
Realtime session OpenAI Realtime adapter, state machine, cost guard, audio buffering.
Agent router Tool registry, safety-tier classification, parameter validation.
Approval service Verbal approval flow, 30s timeout, allow/blocklist.
Memory service Postgres schema, contacts, preferences, conversation history (Alembic migrations).
Tools Telephony, email, search, calendar — all implemented against real APIs (not yet live-tested end-to-end).
Device client Audio I/O, WebSocket reconnection, status indicators, press-to-talk + wake-word activation.
Pi GPIO GpioButtonHandler (gpiod, BCM 17, <500ms reject). Pending hardware to test.
Whisplay LCD Not yet ported. StatusManager is abstracted to plug in.
Wake word openWakeWord integrated; "hey aira" custom model not yet trained.
Tests 56 passing. Structural coverage; integration tests are the next priority.

The code is ~95% scaffolded, the testing is ~30% there. Shipping path is: train wake word → live-test the OpenAI path on home hardware → wire and verify backend audio_commit / audio_cancel routing → end-to-end voice test → port Whisplay LCD when Pi hardware arrives.


Roadmap

v1 — Magic loop

  • Custom-trained hey_aira wake-word model
  • Live end-to-end test: home wifi, real OpenAI Realtime, real Telnyx call placed by voice
  • Pi build with Whisplay HAT (button + LCD + RGB feedback)
  • Backend audio_commit / audio_cancel routing
  • Integration tests covering the full voice → tool → response loop

v2 — Personalization

  • Contact disambiguation that actually learns ("call mom" → which mom?)
  • Multi-user voice ID
  • Per-device preferences (Pi defaults vs. home defaults)
  • Conversation memory pruning that respects what's worth remembering

v3 — Connectors

  • Calendar event creation from voice
  • Note-taking and reminder workflows
  • Voice-driven home automation (Home Assistant bridge)

Development

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy .

# Pre-commit hooks
uv run pre-commit install

Acknowledgments


License

MIT — see LICENSE.

About

Voice-first personal assistant — wake-word at home, press-to-talk on a Raspberry Pi, one identity across both. Built on FastAPI + OpenAI Realtime API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages