From a simple ESPHome full-duplex doorbell to a PBX-like multi-device intercom, all the way to a complete Voice Assistant experience, with wake word detection, echo cancellation, LVGL touchscreen UI, intercom, and ready-to-flash configs for tested ESP32 hardware.
![]() Idle |
![]() Calling |
![]() Ringing |
![]() In Call |
- Overview
- Features
- Architecture
- Installation
- Operating Modes
- Configuration Reference
- Entities and Controls
- Call Flow Diagrams
- Hardware Support
- i2s_audio_duplex
- Voice Assistant + Intercom Experience
- Troubleshooting
- License
Intercom API is a scalable full-duplex ESPHome intercom framework that grows with your needs:
| Use Case | Configuration | Description |
|---|---|---|
| 🔔 Simple Doorbell | 1 ESP + Browser | Ring notification, answer from phone/PC |
| 🏠 Home Intercom | Multiple ESPs | Call between rooms (Kitchen ↔ Bedroom) |
| 📞 PBX-like System | ESPs + Browser + HA | Full intercom network with Home Assistant as a participant |
| 🤖 Voice Assistant + Intercom | ESP (display optional) | Wake word, voice commands, weather, intercom, all on one device |
Home Assistant acts as the central hub - it can receive calls (doorbell), make calls to ESPs, and relay calls between devices. All audio flows through HA, enabling remote access without complex NAT/firewall configuration.
graph TD
HA[🏠 Home Assistant<br/>PBX hub]
ESP1[📻 ESP #1<br/>Kitchen]
ESP2[📻 ESP #2<br/>Bedroom]
Browser[🌐 Browser<br/>Phone]
HA <--> ESP1
HA <--> ESP2
HA <--> Browser
This component was born from the limitations of esphome-intercom, which uses direct ESP-to-ESP UDP communication. That approach works great for local networks but fails in these scenarios:
- Remote access: WebRTC/go2rtc fails through NAT without port forwarding
- Complex setup: Requires go2rtc server, STUN/TURN configuration
- Browser limitations: WebRTC permission and codec issues
Intercom API solves these problems:
- Uses ESPHome's native API for control (port 6053)
- Opens a dedicated TCP socket for audio streaming (port 6054)
- Works remotely - Audio streams through HA's WebSocket, so Nabu Casa/reverse proxy/VPN all work
- No WebRTC, no go2rtc, no port forwarding required
- Full-duplex audio - Talk and listen simultaneously
- Two operating modes:
- Simple: Browser ↔ Home Assistant ↔ ESP
- Full: ESP ↔ Home Assistant ↔ ESP (intercom between devices)
- Echo Cancellation (AEC) - Built-in acoustic echo cancellation using ESP-SR (ES8311 digital feedback mode provides perfect sample-accurate echo cancellation)
- Voice Assistant compatible - Coexists with ESPHome Voice Assistant and Micro Wake Word
- Auto Answer - Configurable automatic call acceptance
- Ringtone on incoming calls - Devices play a looping ringtone sound while ringing
- Volume Control - Adjustable speaker volume and microphone gain
- Contact Management - Select call destination from discovered devices
- Status LED - Visual feedback for call states
- Persistent Settings - Volume, gain, AEC state saved to flash
- Remote Access - Works through any HA remote access method
graph TB
subgraph HA[🏠 HOME ASSISTANT]
subgraph Integration[intercom_native integration]
WS[WebSocket API<br/>/start /stop /audio]
TCP[TCP Client<br/>Port 6054<br/>Async queue]
Bridge[Auto-Bridge<br/>Full Mode<br/>ESP↔ESP relay]
end
end
subgraph Browser[🌐 Browser]
Card[Lovelace Card<br/>AudioWorklet<br/>getUserMedia]
end
subgraph ESP[📻 ESP32]
API[intercom_api<br/>FreeRTOS Tasks<br/>I2S mic/spk]
end
Card <-->|WebSocket<br/>JSON+Base64| WS
API <-->|TCP :6054<br/>Binary PCM| TCP
| Parameter | Value |
|---|---|
| Sample Rate | 16000 Hz |
| Bit Depth | 16-bit signed PCM |
| Channels | Mono |
| ESP Chunk Size | 512 bytes (256 samples = 16ms) |
| Browser Chunk Size | 2048 bytes (1024 samples = 64ms) |
Header (4 bytes):
| Byte 0 | Byte 1 | Bytes 2-3 |
|---|---|---|
| Type | Flags | Length (LE) |
Message Types:
| Code | Name | Description |
|---|---|---|
| 0x01 | AUDIO | PCM audio data |
| 0x02 | START | Start streaming (includes caller_name, no_ring flag) |
| 0x03 | STOP | Stop streaming |
| 0x04 | PING | Keep-alive |
| 0x05 | PONG | Keep-alive response |
| 0x06 | ERROR | Error notification |
- In HACS, go to ⋮ → Custom repositories
- Add
https://github.com/n-IA-hane/intercom-apias Integration - Find "Intercom Native" and click Download
- Restart Home Assistant
- Go to Settings → Integrations → Add Integration → search "Intercom Native" → click Submit
The integration automatically registers the Lovelace card, no manual frontend setup needed.
# From the repository root
cp -r custom_components/intercom_native /config/custom_components/Then either:
- Add via UI: Settings → Integrations → Add Integration → Intercom Native
- Or add to
configuration.yaml:intercom_native:
Restart Home Assistant.
The integration will:
- Register WebSocket API commands for the card
- Create
sensor.intercom_active_devices(lists all intercom ESPs) - Auto-detect ESP state changes for Full Mode bridging
- Auto-register the Lovelace card as a frontend resource
Add the external component to your ESPHome device configuration:
external_components:
- source: github://n-IA-hane/intercom-api
components: [intercom_api, esp_aec]esp32:
board: esp32-s3-devkitc-1
framework:
type: esp-idf
sdkconfig_options:
# Default is 10, increased for: TCP server + API + OTA
CONFIG_LWIP_MAX_SOCKETS: "16"
# I2S Audio (example with separate mic/speaker)
i2s_audio:
- id: i2s_mic_bus
i2s_lrclk_pin: GPIO3
i2s_bclk_pin: GPIO2
- id: i2s_spk_bus
i2s_lrclk_pin: GPIO6
i2s_bclk_pin: GPIO7
microphone:
- platform: i2s_audio
id: mic_component
i2s_audio_id: i2s_mic_bus
i2s_din_pin: GPIO4
adc_type: external
pdm: false
bits_per_sample: 32bit
sample_rate: 16000
speaker:
- platform: i2s_audio
id: spk_component
i2s_audio_id: i2s_spk_bus
i2s_dout_pin: GPIO8
dac_type: external
sample_rate: 16000
bits_per_sample: 16bit
# Echo Cancellation (recommended)
esp_aec:
id: aec_processor
sample_rate: 16000
filter_length: 4 # 64ms tail length
mode: voip_low_cost # Optimized for real-time
# Intercom API - Simple mode (browser only)
intercom_api:
id: intercom
mode: simple
microphone: mic_component
speaker: spk_component
aec_id: aec_processorintercom_api:
id: intercom
mode: full # Enable ESP↔ESP calls
microphone: mic_component
speaker: spk_component
aec_id: aec_processor
ringing_timeout: 30s # Auto-decline unanswered calls
# FSM event callbacks
on_ringing:
- light.turn_on:
id: status_led
effect: "Ringing"
on_outgoing_call:
- light.turn_on:
id: status_led
effect: "Calling"
on_streaming:
- light.turn_on:
id: status_led
red: 0%
green: 100%
blue: 0%
on_idle:
- light.turn_off: status_led
# Switches (with restore from flash)
switch:
- platform: intercom_api
intercom_api_id: intercom
auto_answer:
name: "Auto Answer"
restore_mode: RESTORE_DEFAULT_OFF
aec:
name: "Echo Cancellation"
restore_mode: RESTORE_DEFAULT_ON
# Volume controls
number:
- platform: intercom_api
intercom_api_id: intercom
speaker_volume:
name: "Speaker Volume"
mic_gain:
name: "Mic Gain"
# Buttons for manual control
button:
- platform: template
name: "Call"
on_press:
- intercom_api.call_toggle:
id: intercom
- platform: template
name: "Next Contact"
on_press:
- intercom_api.next_contact:
id: intercom
# Subscribe to HA's contact list (Full mode)
text_sensor:
- platform: homeassistant
id: ha_active_devices
entity_id: sensor.intercom_active_devices
on_value:
- intercom_api.set_contacts:
id: intercom
contacts_csv: !lambda 'return x;'
# Example: call a specific room from HA automation
# or use in YAML lambda with intercom_api.set_contact
button:
- platform: template
name: "Call Kitchen"
on_press:
- intercom_api.set_contact:
id: intercom
contact: "Kitchen Intercom"
- intercom_api.start:
id: intercomEach GPIO button can call a different room — like a condominium intercom panel:
binary_sensor:
# Button 1: Call Kitchen
- platform: gpio
pin:
number: GPIO4
mode: INPUT_PULLUP
inverted: true
on_press:
- intercom_api.set_contact:
id: intercom
contact: "Kitchen Intercom"
- intercom_api.start:
id: intercom
# Button 2: Call Living Room
- platform: gpio
pin:
number: GPIO5
mode: INPUT_PULLUP
inverted: true
on_press:
- intercom_api.set_contact:
id: intercom
contact: "Living Room Intercom"
- intercom_api.start:
id: intercom
⚠️ Name matching is exact (case-sensitive). Thecontactvalue must match the device name exactly as it appears in the contacts list. There is no fuzzy matching or validation — a typo will silently fail and fireon_call_failed.Contact names come from the
name:substitution in each device's YAML. Home Assistant converts the ESPHome name to a display name:name: kitchen-intercom→ HA device nameKitchen Intercom(hyphens become spaces, words capitalized).How to verify the correct name: check the
sensor.{name}_destinationentity in HA — cycle through contacts and note the exact string shown for each device.
The Lovelace card is automatically registered when the integration loads, no manual file copying or resource registration needed.
The card is available in the Lovelace card picker - just search for "Intercom":
Then configure it with the visual editor:
Alternatively, you can add it manually via YAML:
type: custom:intercom-card
entity_id: <your_esp_device_id>
name: Kitchen Intercom
mode: full # or 'simple'The card automatically discovers ESPHome devices with the intercom_api component.
The Lovelace card provides full-duplex bidirectional audio with the ESP device: you can talk and listen simultaneously through your browser or the Home Assistant Companion app. The card captures audio from your microphone via getUserMedia() and plays incoming audio from the ESP in real-time.
Important: HTTPS required. Browser microphone access (
getUserMedia) requires a secure context. You need HTTPS to use the card's audio features. Solutions: Nabu Casa, Let's Encrypt, reverse proxy with SSL, or self-signed certificate. Exception:localhostworks without HTTPS.
Note: Devices must be added to Home Assistant via the ESPHome integration before they appear in the card.
In Simple mode, the browser communicates directly with a single ESP device through Home Assistant. If the ESP has Auto Answer enabled, streaming starts automatically when you call.
graph LR
Browser[🌐 Browser] <-->|WebSocket| HA[🏠 HA]
HA <-->|TCP 6054| ESP[📻 ESP]
Call Flow (Browser → ESP):
- User clicks "Call" in browser
- Card sends
intercom_native/startto HA - HA opens TCP connection to ESP:6054
- HA sends START message (caller="Home Assistant")
- ESP enters Ringing state (or auto-answers)
- Bidirectional audio streaming begins
Call Flow (ESP → Browser):
- User presses "Call" on ESP (with destination set to "Home Assistant")
- ESP sends RING message to HA
- HA notifies all connected browser cards
- Card shows incoming call with Answer/Decline buttons
- User clicks "Answer" in browser
- Bidirectional audio streaming begins
Use Simple mode when:
- You want a simple doorbell with full-duplex audio
- You need browser-to-ESP and ESP-to-browser communication
- You want minimal configuration
Full mode includes everything from Simple mode (Browser ↔ ESP calls) plus enables a PBX-like system where ESP devices can also call each other through Home Assistant, which acts as an audio relay.
graph TB
ESP1[📻 ESP #1<br/>Kitchen] <-->|TCP 6054| HA[🏠 HA<br/>PBX hub]
ESP2[📻 ESP #2<br/>Bedroom] <-->|TCP 6054| HA
Browser[🌐 Browser/App] <-->|WebSocket| HA
Call Flow (ESP #1 calls ESP #2):
- User selects "Bedroom" on ESP #1 display/button
- User presses Call button → ESP #1 enters "Outgoing" state
- HA detects state change via ESPHome API
- HA sends START to ESP #2 (caller="Kitchen")
- ESP #2 enters "Ringing" state
- User answers on ESP #2 (or auto-answer)
- HA bridges audio: ESP #1 ↔ HA ↔ ESP #2
- Either device can hangup → STOP propagates to both
Full mode features:
- Contact list auto-discovery from HA
- Next/Previous contact navigation
- Caller ID display
- Ringing timeout with auto-decline
- Bidirectional hangup propagation
When an ESP device has "Home Assistant" selected as destination and initiates a call (via GPIO button press or template button), it fires an esphome.intercom_call event for notifications and the Lovelace card goes into ringing state with Answer/Decline buttons:
| Option | Type | Default | Description |
|---|---|---|---|
id |
ID | Required | Component ID |
mode |
string | simple |
simple (browser only) or full (ESP↔ESP) |
microphone |
ID | Required | Reference to microphone component |
speaker |
ID | Required | Reference to speaker component |
aec_id |
ID | - | Reference to esp_aec component |
dc_offset_removal |
bool | false | Remove DC offset (for mics like SPH0645) |
ringing_timeout |
time | 0s | Auto-decline after timeout (0 = disabled) |
| Callback | Trigger | Use Case |
|---|---|---|
on_ringing |
Incoming call (auto_answer OFF) | Turn on ringing LED/sound, show display page |
on_outgoing_call |
User initiated call | Show "Calling..." status |
on_answered |
Call was answered (local or remote) | Log event |
on_streaming |
Audio streaming active | Solid LED, enable amp |
on_idle |
State returns to idle | Turn off LED, disable amp |
on_hangup |
Call ended normally | Log with reason string |
on_call_failed |
Call failed (unreachable, busy, etc.) | Show error with reason string |
| Action | Description |
|---|---|
intercom_api.start |
Start outgoing call |
intercom_api.stop |
Hangup current call |
intercom_api.answer_call |
Answer incoming call |
intercom_api.decline_call |
Decline incoming call |
intercom_api.call_toggle |
Smart: idle→call, ringing→answer, streaming→hangup |
intercom_api.next_contact |
Select next contact (Full mode) |
intercom_api.prev_contact |
Select previous contact (Full mode) |
intercom_api.set_contacts |
Update contact list from CSV |
intercom_api.set_contact |
Select a specific contact by name |
intercom_api.set_volume |
Set speaker volume (float, 0.0–1.0) |
intercom_api.set_mic_gain_db |
Set microphone gain (float, -20.0 to +20.0 dB) |
| Condition | Returns true when |
|---|---|
intercom_api.is_idle |
State is Idle |
intercom_api.is_ringing |
State is Ringing (incoming) |
intercom_api.is_calling |
State is Outgoing (waiting answer) |
intercom_api.is_in_call |
State is Streaming (active call) |
intercom_api.is_streaming |
Audio is actively streaming |
intercom_api.is_answering |
Call is being answered |
intercom_api.is_incoming |
Has incoming call |
| Option | Type | Default | Description |
|---|---|---|---|
id |
ID | Required | Component ID |
sample_rate |
int | 16000 | Must match audio sample rate |
filter_length |
int | 4 | Echo tail in frames (4 = 64ms) |
mode |
string | voip_low_cost |
AEC algorithm mode |
AEC modes (ESP-SR closed-source Espressif library):
| Mode | CPU | Memory | Use Case |
|---|---|---|---|
voip_low_cost |
Low | Low | Recommended, sufficient for all setups including VA + MWW |
voip_high_perf |
Medium | Medium | Better filter quality, try if not using display/heavy workloads |
sr_low_cost |
Medium | Medium | Speech recognition optimized, alternative to voip modes |
sr_high_perf |
High | Very High | Best cancellation but may exhaust DMA memory on ESP32-S3 |
Note: All modes have similar CPU cost per frame (~7ms). The difference is primarily in memory allocation and adaptive filter quality.
| Entity | Type | Description |
|---|---|---|
sensor.{name}_intercom_state |
Text Sensor | Current state: Idle, Ringing, Streaming, etc. |
| Entity | Type | Description |
|---|---|---|
sensor.{name}_destination |
Text Sensor | Currently selected contact |
sensor.{name}_caller |
Text Sensor | Who is calling (during incoming call) |
sensor.{name}_contacts |
Text Sensor | Contact count |
| Platform | Entities |
|---|---|
switch |
auto_answer, aec |
number |
speaker_volume (0-100%), mic_gain (-20 to +20 dB) |
button |
Call, Next Contact, Prev Contact, Decline (template) |
sequenceDiagram
participant B as 🌐 Browser
participant HA as 🏠 Home Assistant
participant E as 📻 ESP
B->>HA: WS: start {host: "esp.local"}
HA->>E: TCP Connect :6054
HA->>E: START {caller:"HA"}
Note right of E: State: Ringing<br/>(or auto-answer)
E-->>HA: PONG (answered)
Note right of E: State: Streaming
loop Bidirectional Audio
B->>HA: WS: audio (base64)
HA->>E: TCP: AUDIO (PCM) → Speaker
E->>HA: TCP: AUDIO (PCM) ← Mic
HA->>B: WS: audio_event
end
B->>HA: WS: stop
HA->>E: TCP: STOP
Note right of E: State: Idle
sequenceDiagram
participant E1 as 📻 ESP #1 (Caller)
participant HA as 🏠 Home Assistant
participant E2 as 📻 ESP #2 (Callee)
Note left of E1: State: Outgoing<br/>(user pressed Call)
E1->>HA: ESPHome API state change
HA->>E2: TCP Connect :6054
HA->>E2: START {caller:"ESP1"}
Note right of E2: State: Ringing
HA->>E1: TCP Connect :6054
HA->>E1: START {caller:"ESP2"}
Note left of E1: State: Ringing
E2-->>HA: PONG (user answered)
Note right of E2: State: Streaming
HA-->>E1: PONG
Note left of E1: State: Streaming
loop Bridge relays audio
E1->>HA: AUDIO (mic)
HA->>E2: AUDIO → Speaker
E2->>HA: AUDIO (mic)
HA->>E1: AUDIO → Speaker
end
E1->>HA: STOP (hangup)
HA->>E2: STOP
Note left of E1: State: Idle
Note right of E2: State: Idle
| Device | Microphone | Speaker | I2S Mode | Component | AEC Reference | VA/MWW |
|---|---|---|---|---|---|---|
| ESP32-S3 Mini | SPH0645 | MAX98357A | Dual bus | i2s_audio |
Ring buffer | Yes (mixer speaker) |
| Xiaozhi Ball V3 | ES8311 | ES8311 | Single bus | i2s_audio_duplex |
ES8311 digital feedback (stereo) | Yes (dual mic path) |
| Waveshare ESP32-S3-AUDIO | ES7210 (4-ch) | ES8311 | Single bus TDM | i2s_audio_duplex |
ES7210 TDM analog (MIC3) | Yes (dual mic path) |
| Waveshare ESP32-P4-WiFi6-Touch-LCD-10.1 | ES7210 (4-ch) | ES8311 | Single bus TDM | i2s_audio_duplex |
ES7210 TDM analog (MIC3) | Yes (dual mic path, LVGL touch display) |
Want to help expand this list? Send me a device to test or consider a donation, every bit helps!
- ESP32-S3 or ESP32-P4 with PSRAM (required for AEC)
- I2S microphone (INMP441, SPH0645, ES8311, etc.)
- I2S speaker amplifier (MAX98357A, ES8311, etc.)
- ESP-IDF framework (not Arduino)
This repo also provides i2s_audio_duplex, a full-duplex I2S component for single-bus audio codecs (ES8311, ES8388, WM8960) and multi-codec TDM setups (ES8311 + ES7210). Standard ESPHome i2s_audio cannot drive mic and speaker on the same I2S bus simultaneously; i2s_audio_duplex solves this with:
- True full-duplex on a single I2S bus
- Built-in AEC integration: stereo digital feedback, TDM hardware reference, or ring buffer
- Dual mic paths: raw (pre-AEC) for wake word + AEC-processed for voice assistant
- FIR decimation: run the bus at 48kHz (codec native) while processing at 16kHz
- Reference counting: multiple consumers share the same mic safely
The I2S bus runs at a higher rate for better DAC/ADC quality, with internal FIR decimation to produce 16kHz for processing:
| Parameter | Value |
|---|---|
| I2S Bus Rate | Configurable (sample_rate, e.g. 48000 Hz) |
| Output Rate | Configurable (output_sample_rate, e.g. 16000 Hz) |
| Decimation | FIR filter, ratio = bus/output (e.g. ×3 for 48→16kHz) |
| FIR Filter | 31-tap, Kaiser beta=8.0, ~60dB stopband, linear phase |
| Speaker Input | Bus rate (48kHz), ESPHome resampler upsamples before play |
| Mic Output | Output rate (16kHz), for MWW, Voice Assistant, Intercom |
MWW, Voice Assistant STT, and Intercom operate at 16kHz internally. The I2S bus runs at 48kHz (the codec's native rate), so:
- TTS via
announcement_pipelinewithsample_rate: 48000arrives at 48kHz from HA. Full 48kHz quality to the DAC. - Streaming radio / Music Assistant audio arrives at the sample rate declared by the media player -48kHz when configured as such.
- Media files (timer sounds, notifications) at native 48kHz are played directly without resampling.
- Intercom audio is sent/received at 16kHz over TCP and upsampled to 48kHz for local playback via the resampler speaker.
Many integrated codecs use a single I2S bus for both mic and speaker. Standard ESPHome i2s_audio cannot handle this simultaneously. Use i2s_audio_duplex:
external_components:
- source: github://n-IA-hane/intercom-api
components: [intercom_api, i2s_audio_duplex, esp_aec]
i2s_audio_duplex:
id: i2s_duplex
i2s_lrclk_pin: GPIO45
i2s_bclk_pin: GPIO9
i2s_mclk_pin: GPIO16
i2s_din_pin: GPIO10
i2s_dout_pin: GPIO8
sample_rate: 48000 # I2S bus rate (codec native)
output_sample_rate: 16000 # Mic/AEC/MWW/VA rate (FIR decimation ×3)
microphone:
- platform: i2s_audio_duplex
id: mic_component
i2s_audio_duplex_id: i2s_duplex
speaker:
- platform: i2s_audio_duplex
id: spk_component
i2s_audio_duplex_id: i2s_duplexIf your codec supports it (ES8311, and potentially others with DAC loopback), stereo digital feedback is the optimal AEC reference method. This is the single most impactful configuration choice.
How it works:
- ES8311 outputs a stereo I2S frame: L channel = DAC loopback (what the speaker is playing), R channel = ADC (microphone)
- The reference signal is sample-accurate: same I2S frame as the mic capture, no timing estimation needed
aec_reference_delay_ms: 10(just a few ms for internal codec latency, vs ~80ms for ring buffer mode)
i2s_audio_duplex:
aec_id: aec_component
use_stereo_aec_reference: true # Enable DAC feedback
aec_reference_delay_ms: 10 # Sample-aligned, minimal delay
esphome:
on_boot:
- lambda: |-
// Configure ES8311 register 0x44: output DAC+ADC on stereo ASDOUT
uint8_t data[2] = {0x44, 0x48};
id(i2c_bus).write(0x18, data, 2);Without stereo feedback, the component falls back to a ring buffer reference: it copies speaker audio to a delay buffer and reads it back ~80ms later to match the acoustic path. This works with any codec but requires careful delay tuning and is never perfectly aligned.
For boards with a multi-channel ADC (ES7210), the AEC reference can be captured as a hardware analog signal: the ES8311 DAC output is wired to an ES7210 input (MIC3), providing a sample-aligned reference from the same TDM I2S frame:
i2s_audio_duplex:
id: i2s_duplex
i2s_lrclk_pin: GPIO14
i2s_bclk_pin: GPIO13
i2s_mclk_pin: GPIO12
i2s_din_pin: GPIO15
i2s_dout_pin: GPIO16
sample_rate: 48000
output_sample_rate: 16000
aec_id: aec_processor
use_tdm_reference: true
tdm_total_slots: 4
tdm_mic_slots: [0, 2] # ADC1(MIC1), ADC2(MIC2)
tdm_ref_slot: 1 # ADC3(MIC3) = ES8311 DAC feedbackNote: ES7210 requires an
on_bootlambda (priority 200) to enable TDM mode and set MIC3 gain to 0dB. Seewaveshare-s3-audio-va-intercom.yamlfor the complete working config.
i2s_audio_duplex provides two microphone outputs, raw (pre-AEC) and AEC-processed, enabling wake word detection during TTS playback:
microphone:
- platform: i2s_audio_duplex
id: mic_aec # AEC-processed: for VA STT + intercom TX
i2s_audio_duplex_id: i2s_duplex
- platform: i2s_audio_duplex
id: mic_raw # Raw: for MWW (pre-AEC, hears through TTS)
i2s_audio_duplex_id: i2s_duplex
pre_aec: true
micro_wake_word:
microphone: mic_raw # Raw mic for best wake word detection
voice_assistant:
microphone: mic_aec # AEC mic for clean STTSee the i2s_audio_duplex README for full details.
![]() ESP32-P4: Weather + Voice Assistant |
![]() ESP32-P4: Intercom + Voice Assistant |
![]() Xiaozhi Ball: VA + Intercom |
The Voice Assistant and Intercom coexist seamlessly on the same hardware: shared microphone, shared speaker (via audio mixer), shared wake word detection. No display required (works on headless devices like the Waveshare S3 Audio); on devices with a screen, you also get a full touch UI:
- Always listening: Micro Wake Word runs continuously on raw (pre-AEC) audio, detecting the wake word even while TTS is playing or during an intercom call
- Touch or voice: Start the assistant by saying the wake word or tapping the screen (on touch displays)
- Barge-in: Say the wake word during a TTS response to interrupt and ask a new question
- Intercom calls: Call other devices or Home Assistant with one tap; incoming calls ring with audio + visual feedback
- Weather at a glance: Current conditions, temperature, and 5-day forecast updated automatically (touch displays)
- Mood-aware responses: The assistant shows different expressions (happy, neutral, angry) based on the tone of its reply. Requires instructing your LLM to prepend an ASCII emoticon (
:-):-(:-|) to each response based on its tone
AEC uses Espressif's closed-source ESP-SR library. All modes have similar CPU cost per frame (~7ms out of 16ms budget). The difference is primarily in memory allocation and adaptive filter quality.
Recommended: voip_low_cost for devices with integrated codecs (ES8311, ES8388). This is more than sufficient for echo cancellation in voice calls and intercom, while keeping CPU free for Voice Assistant, MWW, and display rendering.
esp_aec:
sample_rate: 16000
filter_length: 4 # 64ms tail, sufficient for integrated codecs
mode: voip_low_cost # Light on resources, good echo cancellationIf you are not using a display or AEC-heavy workloads, and want to experiment with better cancellation quality, you can try voip_high_perf with filter_length: 8. But voip_low_cost is the safe default.
Avoid sr_high_perf: It allocates very large DMA buffers that can exhaust memory on ESP32-S3, causing SPI errors and instability.
AEC processing is automatically gated: it only runs when the speaker had real audio within the last 250ms. When the speaker is silent (idle, no TTS, no intercom audio), AEC is bypassed and mic audio passes through unchanged.
This prevents the adaptive filter from drifting during silence, which would otherwise suppress the mic signal and kill wake word detection. The gating is transparent, no configuration needed.
Two custom Micro Wake Word models trained by the author are included in the wakewords/ directory:
- Hey Bender (
hey_bender.json): inspired by the Futurama character - Hey Trowyayoh (
hey_trowyayoh.json): phonetic spelling of the Italian word "troiaio" (roughly: "what a mess", or more colorfully, "bullshit")
These are standard .json + .tflite files compatible with ESPHome's micro_wake_word. To use them:
micro_wake_word:
models:
- model: "wakewords/hey_trowyayoh.json"Running a display alongside Voice Assistant, Micro Wake Word, AEC, and intercom on a single ESP32-S3 is challenging due to RAM and CPU constraints. The xiaozhi-ball-v3.yaml and waveshare-p4-touch-lcd-va-intercom.yaml configs demonstrate proven approaches using LVGL (Light and Versatile Graphics Library):
| Before (ili9xxx manual) | After (LVGL) |
|---|---|
| 14 C++ page lambdas | Declarative YAML widgets |
26 component.update calls |
Automatic dirty-region refresh |
animate_display script (40 lines) |
animimg widget (built-in) |
text_pagination_timer script |
long_mode: SCROLL_CIRCULAR |
| Precomputed geometry (chord widths, x/y metrics) | LVGL layout engine |
| Manual ping-pong frame logic | Duplicated frame list in animimg src: |
Key benefits: lower CPU (dirty-region only), no component.update contention, native animation (animimg), mood-based backgrounds via lv_img_set_src(), and automatic text scrolling (SCROLL_CIRCULAR).
Timer overlays use top_layer with LV_OBJ_FLAG_HIDDEN, visible on any page. Media files are auto-resampled by the platform: resampler speaker in the mixer pipeline.
Every setup is different: room acoustics, mic sensitivity, speaker placement, codec characteristics. We encourage you to:
- Try different
filter_lengthvalues (4 vs 8), longer isn't always better if your acoustic path is short - Toggle AEC on/off during calls to hear the difference; the
aecswitch is available in HA - Adjust
mic_gain: higher gain helps voice detection but can introduce noise - Test MWW during TTS with your specific wake word, some words are more robust than others
- Compare
voip_low_costvsvoip_high_perf: the difference may be subtle in your environment - Monitor ESP logs: AEC diagnostics, task timing, and heap usage are all logged at DEBUG level
- Verify
intercom_native:is inconfiguration.yaml - Restart Home Assistant after adding the integration
- Ensure ESP device is connected via ESPHome integration
- Check ESP has
intercom_apicomponent configured - Clear browser cache and reload
- Check speaker wiring and I2S pin configuration
- Verify
speaker_enableGPIO if your amp has an enable pin - Check volume level (default 80%)
- Look for I2S errors in ESP logs
- Check browser microphone permissions
- Verify HTTPS (required for getUserMedia)
- Check browser console for AudioContext errors
- Try a different browser (Chrome recommended)
- Enable AEC: create
esp_aeccomponent and link withaec_id - Ensure AEC switch is ON in Home Assistant
- Reduce speaker volume
- Increase physical distance between mic and speaker
- Check WiFi signal strength (should be > -70 dBm)
- Verify Home Assistant is not overloaded
- Check for network congestion
- Reduce ESP log level to
WARN
- Check TCP port 6054 is accessible
- Verify no firewall blocking HA→ESP connection
- Check Home Assistant logs for connection errors
- Try restarting the ESP device
- Ensure all ESPs use
mode: full - Verify
sensor.intercom_active_devicesexists in HA - Check ESP subscribes to this sensor via
text_sensor: platform: homeassistant - Devices must be online and connected to HA
When an ESP device calls "Home Assistant", it fires an esphome.intercom_call event. Use this automation to receive push notifications:
alias: Doorbell Notification
description: Send push notification when doorbell rings - tap to open intercom
triggers:
- trigger: event
event_type: esphome.intercom_call
conditions: []
actions:
- action: notify.mobile_app_your_phone
data:
title: "🔔 Incoming Call"
message: "📞 {{ trigger.event.data.caller }} is calling..."
data:
clickAction: /lovelace/intercom
channel: doorbell
importance: high
ttl: 0
priority: high
actions:
- action: URI
title: "📱 Open"
uri: /lovelace/intercom
- action: ANSWER
title: "✅ Answer"
- action: persistent_notification.create
data:
title: "🔔 Incoming Call"
message: "📞 {{ trigger.event.data.caller }} is calling..."
notification_id: intercom_call
mode: singleEvent data available:
trigger.event.data.caller- Device name (e.g., "Intercom Xiaozhi")trigger.event.data.destination- Always "Home Assistant"trigger.event.data.type- "doorbell"
Note: Replace
notify.mobile_app_your_phonewith your mobile app service and/lovelace/intercomwith your dashboard URL.
💡 The possibilities are endless! This event can trigger any Home Assistant automation. Some ideas: flash smart lights to get attention, play a chime on media players, announce "Someone is at the door" via TTS on your smart speakers, auto-unlock for trusted callers, trigger a camera snapshot, or notify all family members simultaneously.
title: Intercom
views:
- title: Intercom
icon: mdi:phone-voip
cards: []
type: sections
max_columns: 2
sections:
- type: grid
cards:
- type: custom:intercom-card
entity_id: <your_device_id>
name: Intercom Mini
mode: full
- type: entities
entities:
- entity: number.intercom_mini_speaker_volume
name: Volume
- entity: number.intercom_mini_mic_gain
name: Mic gain
- entity: switch.intercom_mini_echo_cancellation
- entity: switch.intercom_mini_auto_answer
- entity: sensor.intercom_mini_contacts
- entity: button.intercom_mini_refresh_contacts
- type: grid
cards:
- type: custom:intercom-card
entity_id: <your_device_id>
name: Intercom Xiaozhi
mode: full
- type: entities
entities:
- entity: number.intercom_xiaozhi_speaker_volume
name: Volume
- entity: number.intercom_xiaozhi_mic_gain
name: Mic gain
- entity: switch.intercom_xiaozhi_echo_cancellation
- entity: switch.intercom_xiaozhi_auto_answer
- entity: sensor.intercom_xiaozhi_contacts
- entity: button.intercom_xiaozhi_refresh_contactsWorking configs tested on real hardware are included in the repository:
| File | Device | Features |
|---|---|---|
xiaozhi-ball-v3.yaml |
Xiaozhi Ball V3 (ES8311) | VA + MWW + Intercom + LVGL display + 48kHz audio |
xiaozhi-ball-v3-intercom.yaml |
Xiaozhi Ball V3 (ES8311) | Intercom only, C++ display |
waveshare-s3-audio-va-intercom.yaml |
Waveshare ESP32-S3-AUDIO (ES8311 + ES7210) | VA + MWW + Intercom + TDM AEC + LED feedback |
waveshare-p4-touch-lcd-va-intercom.yaml |
Waveshare ESP32-P4-WiFi6-Touch-LCD-10.1 (ES8311 + ES7210) | VA + MWW + Intercom + LVGL 10.1" touch split-screen (weather + intercom tileview, touch-to-talk VA with mood images, 5-day forecast) + ringtone |
esp32-s3-mini-va-intercom.yaml |
ESP32-S3 Mini (SPH0645 + MAX98357A) | VA + MWW + Intercom, LED feedback |
esp32-s3-mini-intercom.yaml |
ESP32-S3 Mini (SPH0645 + MAX98357A) | Intercom only, LED feedback |
-
Non-admin user fix: Replaced event bus audio delivery (
hass.bus.async_fire("intercom_audio")) with custom WS subscription command (intercom_native/subscribe_audio). Non-admin HA users can now use the intercom card. Card v2.1.3. -
i2s_audio_duplex deep audit: Major refactor of the real-time audio task for correctness, maintainability, and performance. The monolithic
audio_task_()(800+ lines) is now split into anAudioTaskCtxstruct (groups all buffers, sizes, invariants, per-frame snapshots) and three focused processing functions:process_rx_path_(),process_aec_and_callbacks_(),process_tx_path_(). Cross-threadfloatvariables (mic_gain_,mic_attenuation_,speaker_volume_,aec_ref_volume_) converted tostd::atomic<float>(fixes technically undefined behavior). A snapshot pattern loads all atomics once per 16ms frame into localctxfields, eliminating hundreds of redundant.load()in sample loops. AEC buffers useheap_caps_aligned_alloc(16, ...)for ESP-SR SIMD safety. New YAML optionstask_priority,task_core,task_stack_sizefor per-device tuning (with single-core SoC validation).duplex_microphonepre-allocatesaudio_buffer_to avoid RT heap allocation. Callback typedefs documented with real-time constraints. -
Code audit fixes: Shared
scale_sample()extracted toesp_aec/audio_utils.h. Stack VLAs replaced with heap buffers inintercom_api. S3 Audio LEDtransition_length: 0ms(RMT blocking fix). -
DC offset aligned to upstream: musicdsp.org DC-block filter in Q31 space. In 16-bit space,
>>10truncates to 0 making the filter unstable — must use<<16to Q31 first. -
MWW reliability: All devices switched to
mic_raw(pre_aec: true). Addedalexaas second wake word model. TemplateWake Wordswitch on all YAMLs. Audio buffers keptMALLOC_CAP_INTERNAL(PSRAM broke MWW). Priority stays at 19 (12 was below lwIP, starved MWW). -
P4 UI polish: Noto Sans with
GF_Latin_Core(fixes missing curly quotes). VA layout: image at bottom, text above. Remove Refresh Contacts button. S3 Audio LED fixes.
-
Waveshare ESP32-P4-WiFi6-Touch-LCD-10.1 support: Full VA + MWW + Intercom on the ESP32-P4 RISC-V dual-core (32MB Flash, 32MB PSRAM) with 10.1" MIPI DSI capacitive touch display (GT9271), ES8311 DAC + ES7210 4-ch ADC, WiFi via ESP32-C6 co-processor (SDIO). Ready-to-flash YAML config included (
waveshare-p4-touch-lcd-va-intercom.yaml). -
P4 split-screen UI: Portrait 800x1280 display divided into two halves: top is a swipeable LVGL tileview (weather page with current conditions, MDI icons, and 5-day forecast via
weather.get_forecastsaction; intercom page with contacts, call controls, and dynamic state groups), bottom is a touch-to-talk Voice Assistant area with animated avatar (20-frame idle animation), per-state images (listening, thinking, error), and mood-based replying backgrounds (happy/neutral/angry parsed from LLM emoticon prefix). Full overlay pages for no-WiFi, no-HA, and timer states. -
Ringtone on incoming calls: Devices now play a looping ringtone sound (
sounds/ringtone.flac) while in ringing state. Ringtone stops automatically when the call is answered, declined, or times out. -
New actions:
intercom_api.set_contactselects a contact by name (useful for HA automations and voice commands).intercom_api.set_volumeandintercom_api.set_mic_gain_dballow programmatic control of audio levels from YAML lambdas or automations. -
Card v2.1.2: Error messages now persist across DOM rebuilds (stored in
_errorMsgproperty).disconnectedCallback()properly cleans up mic, AudioContext, and WS subscriptions when the card is removed from DOM. Auto-bridge only matches destination against intercom devices (prevents false matches with non-intercom entities sharing the same name). -
Display logic unification: Xiaozhi and P4 YAML configs now share the same display update pattern: all intercom triggers use
backlight_timerscript (instead of directlight.turn_on), robuston_endwaits for TTS drain before restoring display,ha_active_devices.on_valueupdates display viadraw_displayscript (not direct LVGL calls). Animation stop (lv_anim_del) before page switch prevents split-screen glitch during incoming calls. -
LVGL image format fix (P4): Assistant animation images changed from
type: RGBtotype: RGB565withbyte_order: little_endianto match the display'sLV_COLOR_DEPTH=16. RGB (24-bit) with 16-bit color depth caused LVGL to assignLV_IMG_CF_RGB888, which the built-in decoder cannot open, resulting in "No data" placeholder.
-
Waveshare ESP32-S3-AUDIO-Board support: Full VA + MWW + Intercom on the Waveshare ESP32-S3-AUDIO-Board (ES8311 DAC + ES7210 4-ch ADC). Ready-to-flash YAML config included (
waveshare-s3-audio-va-intercom.yaml). -
TDM hardware AEC reference: New
use_tdm_referencemode for boards with ES7210 multi-channel ADC. ES7210 operates in TDM mode with one slot carrying the voice mic and another carrying the DAC analog output (via MIC3). Reference is sample-aligned from the same I2S frame, no ring buffer delay needed. I2S usesI2S_SLOT_MODE_STEREOfor TDM (MONO only puts slot 0 in DMA). ES8311 reads/writes slot 0 as standard I2S. -
AEC reference volume fix: Research confirmed that both ES8311 digital feedback (stereo loopback) and ES7210 TDM analog capture provide reference signals that already include hardware DAC volume. The previous
aec_reference_volumescaling was double-attenuating the reference in these modes, degrading echo cancellation. Now:aec_reference_volumeis only applied in ring buffer mode (raw PCM before DAC). Stereo and TDM modes apply onlymic_attenuationfor level matching. -
Robustness improvements: Ring buffer race condition fix (atomic request flags for deferred reset), AEC buffer allocation checks, task deletion UB fix (
task_exited_atomic flag), I2S persistent error recovery (consecutive error counter), speaker ref buffer allocation guard for stereo/TDM modes (saves 25-32KB RAM). -
Relaxed atomics: All
std::atomicoperations usememory_order_relaxed(safe on ESP32-S3 cache-coherent Xtensa, eliminates unnecessary MEMW fence instructions in the audio hot loop).
-
48kHz I2S bus with FIR decimation: I2S bus now runs at 48kHz (ES8311 native rate) for noticeably better TTS and media audio quality. Internal 32-tap FIR anti-alias filter (Kaiser β=8.0, ~60dB stopband attenuation, float arithmetic on ESP32-S3 hardware FPU) decimates mic/AEC/VA/intercom paths to 16kHz. Speaker path stays at 48kHz end-to-end: HA transcodes media via ffmpeg_proxy directly to FLAC 48kHz, ESPHome resampler handles any other source rate. New
output_sample_rateconfig option; fully backward compatible (omitting it = no change, ratio=1 = zero-overhead memcpy path). -
FreeRTOS task layout overhaul, MWW detection fully restored: Audio task (
i2s_duplex) moved from Core 1 (priority 9) to Core 0 (priority 19), matching the canonical Espressif AEC pattern. MWW inference (unpinned, priority 3) now naturally schedules to Core 1, completely free from AEC interference. Result: 10/10 wake word detection during TTS (was 1/10). AEC CPU cost is ~42% of Core 0 per 16ms frame regardless; the fix is architectural separation, not mode change. LVGL/display rendering on Core 1 is also no longer preempted by AEC every 16ms. Intercom task priorities aligned to canonical values (srv: 5, tx: 5, spk: 4). -
Audio reliability fixes (code audit): Several race conditions and stuck-state bugs eliminated:
ERRORmessage handler now properly closes socket, resets FSM, and fireson_call_failed(was a no-op, leaving ESP stuck in OUTGOING state)OUTGOINGtimeout now callsset_active_(false)beforeend_call_(), stopping mic/speaker on timeoutdc_offset_IIR state reset between call sessions (was accumulating across sessions, causing audio startup glitch on radio streams)- TOCTOU fixes: single atomic load of
client_.socketin select loop andcall_state_in accept condition - Removed duplicate
STOPsend instop()(already sent byclose_client_socket_())
-
Code cleanup & trigger unification: Removed
client_mode_and the connect/disconnect client-mode branch (never used in production). Unified triggers:on_incoming_callmerged intoon_ringing,on_call_endremoved (covered byon_hangup/on_call_failed). Addedentity_category: configon auto_answer and AEC switches. -
intercom_native HA integration refactor:
websocket_api.pyrestructured: 6 TCP session callbacks extracted from nested closures intoIntercomSessioninstance methods (_on_audio,_on_disconnected,_on_ringing,_on_answered,_on_stop_received,_on_error_received);_create_tcp_client()factory and_stop_device_sessions()helper extracted (eliminates duplicate stop/decline logic inwebsocket_stopandwebsocket_decline). Dead code removed:_set_incoming_caller()function,on_connectedcallback fromtcp_client.py, unused protocol/audio constants fromconst.py(PROTOCOL_VERSION,FLAG_END,ERR_*,SAMPLE_RATE,BITS_PER_SAMPLE,PING_TIMEOUT,EVENT_*). Frontend cleanup: allconsole.logdebug output removed fromintercom-processor.js, dead_unsubscribeStatesubscription removed fromintercom-card.js. Manifest bumped to 2.0.5 withhassfest-compliant key ordering. -
Display & UI fixes: SPI clock 40MHz (halves GC9A01A flush time), LVGL
buffer_size50%, instant page transitions vialv_disp_load_scr(). Fixed stale VA response text persisting on screen when media player starts later: LVGL reply labels now explicitly cleared whentext_responseis set to empty, preventing previous conversation text from reappearing hours later.
What's next: v2.2.0 and beyond
- ESP-AFE integration: Espressif's full Audio Front-End pipeline bundles beamforming, noise suppression, and echo cancellation in a single optimized block. The goal is to offer it as an alternative to the current
esp_aeccomponent; both will remain supported. Noise suppression would particularly benefit analog reference setups (Waveshare ES7210 TDM) where ADC noise floor is higher than digital feedback (ES8311 stereo). - ESP32-P4 hardware DSP: The P4 has a dedicated audio DSP that could potentially offload AEC and noise suppression from the application cores entirely. Initial support is already shipping (v2.1.2), further optimization will explore the hardware accelerators.
- i2s_audio_duplex: mixer compatibility fix: Added
audio_output_callback_forwarding from the I2S audio task to the duplex speaker. Without this,platform: mixersource speakers (va_speaker, intercom_speaker) never detect that audio has been played, staying stuck inSTATE_RUNNINGforever. This causedmedia_player.is_announcingto stay true indefinitely after TTS playback. - i2s_audio_duplex: speaker start/stop idempotency:
start()now uses an atomiclistener_registered_guard withcompare_exchange_strongto prevent multiplexSemaphoreTake()per stream session. Previously,play()callingstart()beforeloop()setSTATE_RUNNINGcaused the semaphore count to leak (N takes, 1 give), preventing the speaker from ever stopping. - New:
xiaozhi-ball-v3.yaml: Voice Assistant + Intercom + LVGL display config for Xiaozhi Ball V3. Uses LVGL declarative widgets instead of manual C++ display lambdas:animimgfor idle animation, mood-based replying backgrounds (happy/neutral/angry parsed from LLM emoticon prefix),SCROLL_CIRCULARfor long text, timer overlay ontop_layer. Coexists with VA, MWW, AEC, and intercom on a single ESP32-S3. - Timer alarm fix: Replaced
REPEAT_ONEmedia player mode (caused TTS to loop instead of timer sound due to race condition) with explicittimer_alarm_loopscript. Fixed timer sound not playing: convertedtimer_finished.flacfrom 48kHz to 16kHz to match announcement pipeline sample rate. - Display fixes: LVGL scrollbar disabled on round screen, battery NaN guard at boot, stale text clearing between VA interactions (labels cleared in
text_sensor.on_valuehandlers) - Intercom stack overflow fix: Increased intercom task stack from 4KB to 8KB to prevent crash during concurrent TTS playback
- YAML reorganization: All configs renamed to descriptive names:
xiaozhi-ball-v3.yaml,xiaozhi-ball-v3-intercom.yaml,esp32-s3-mini-va-intercom.yaml,esp32-s3-mini-intercom.yaml
- Voice Assistant + Intercom coexistence: Full dual-mode operation with MWW, VA, and intercom on the same ESP32-S3
- Ready-to-use YAML configs for Xiaozhi Ball V3 and ESP32-S3 Mini
- Bug fixes:
speaker_running_data race (nowstd::atomic), inconsistent allocator instart_speaker(), removed deadaec_frame_count_ - Performance: Pre-allocated audio buffer in duplex_microphone (eliminates per-frame vector allocation at ~62 Hz)
- ESP32-P4 support: Added to
esp_aecsupported variants,#ifdef USE_ESP_AECguards for clean builds without AEC - Custom wake words: "Hey Bender" and "Hey Trowyayoh" models included
- Documentation overhaul: AEC best practices, ES8311 stereo L/R reference, mode selection guide, attribution headers
- AEC + MWW coexistence: Timeout gating, reference buffer reset on speaker start/stop, TTS barge-in support
- Dual mic path:
pre_aecmicrophone option for raw audio to MWW while AEC-processed audio goes to VA - Code style refactor: C++ casts, include order, format specifiers across all components
- TCP read timeout: Dead connection detection (5s streaming, 60s idle)
- ES8311 Digital Feedback AEC: Sample-accurate echo cancellation via stereo L/R split
- Bridge cleanup fix: Properly remove bridges when calls end
- Reference counting: Counting semaphore for multiple mic/speaker listeners
- MicrophoneSource pattern: Shared microphone access between components
- Full mode: ESP↔ESP calls through HA bridge
- Card as pure ESP state mirror (no internal state tracking)
- Contacts management with auto-discovery
- Persistent settings (volume, gain, AEC saved to flash)
- Initial release
- Simple mode: Browser ↔ HA ↔ ESP
- AEC support via esp_aec component
- i2s_audio_duplex for single-bus codecs
If this project was helpful and you'd like to see more useful ESPHome/Home Assistant integrations, please consider supporting my work:
Your support helps me dedicate more time to open source development. Thank you! 🙏
MIT License - See LICENSE for details.
Contributions are welcome! Please open an issue or pull request on GitHub.
Developed with the help of the ESPHome and Home Assistant communities, and Claude Code as AI pair programming assistant.














