XEye is currently among the Top 28 shortlisted startups of Qualcomm® Vietnam Innovation Challenge (QVIC) 2026.
XEye is a wearable, on-device AI assistant for the visually impaired, built on Qualcomm Dragonwing™ QCS6490 Platform - Thundercomm RUBIK Pi 3.
Technical Notes: English · Tiếng Việt (Last updated: 25/05/2026)
Mic ─────────→ STT ─────────→ VI question (text input)
↓
Camera ────→ Image input ────→ Vision Language Model
↓
VI answer (text output)
↓
TTS ────→ VI answer (audio output) ────→ Speaker playback
| Board | Thundercomm RUBIK Pi 3 |
| SoC | Qualcomm QCS6490 |
| CPU | 4× Cortex-A55 @1.96GHz + 3× Cortex-A78 @2.40GHz + 1× Cortex-X1 @2.71GHz |
| RAM | 8GB LPDDR4x |
| NPU | Hexagon 780 (V73) — 12 TOPS |
| GPU | Adreno 643 |
| OS | Ubuntu (Linux 6.8.0-1071-qcom) |
| Module | Model | Runtime | Quantization | Performance |
|---|---|---|---|---|
| STT | ZipFormer-30M RNNT | sherpa-onnx | int8 | RTF <0.1x |
| VLM | Vintern-1B-v3_5 | llama-cpp-python | Q4_K_M | ~2.8-3.0 tok/s |
| TTS | VieNeu-TTS-v2-Turbo | llama-cpp-python + VieNeu-Codec ONNX | Q4_K_M | ~4-5s latency |
Why CPU-only?
The QCS6490's Hexagon NPU (12 TOPS) is designed for computer vision inference (object detection, classification) — not LLM/VLM workloads. It does not efficiently support attention mechanisms, dynamic KV cache, or large matrix multiplications required by language models. The Adreno GPU shares system RAM, making it unsuitable for models that require several GB of memory. After extensive testing across multiple approaches (llama.cpp Hexagon backend, ONNX Runtime QNN EP, Qualcomm AI Hub), CPU inference with highly quantized models was the only viable path on this hardware.
# 1. Create environment and install dependencies
bash setup.sh
# 2. Activate environment
conda activate xeye
# 3. Download all models (~1.3GB total)
python download_models.py
# 4. Start server
python server.pyServer must be running before using any script (python server.py).
python scripts/stt_infer.py --file data/audio/question.wav# Describe image (no question — uses default prompt)
python scripts/vlm_infer.py data/images/IMG_6817.jpg
# Ask a Vietnamese question
python scripts/vlm_infer.py data/images/IMG_6817.jpg --question "Đây là gì?"# Synthesize to file
python scripts/tts_infer.py "Xin chào" --out data/audio/output.wav
# Synthesize with a different voice
python scripts/tts_infer.py "Xin chào" --out data/audio/output.wav --voice "Phạm Tuyên (Nam - Miền Bắc)"Runs at http://0.0.0.0:8000
| Endpoint | Method | Input | Output |
|---|---|---|---|
/health |
GET | — | {"status": "ok"} |
/stt |
POST | WAV file (multipart) | {"text": "..."} |
/vlm |
POST | image (file), question (Vietnamese, optional) |
{"vi": "...", "tokens": N, "elapsed_s": N, "tok_s": N} |
/tts |
POST | text (Vietnamese), voice (name, optional) |
WAV audio bytes |
# Health check
curl http://localhost:8000/health
# STT
curl -X POST http://localhost:8000/stt \
-F "audio=@data/audio/question.wav"
# VLM
curl -X POST http://localhost:8000/vlm \
-F "image=@data/images/IMG_6817.jpg" \
-F "question=Đây là gì?"
# TTS
curl -X POST http://localhost:8000/tts \
-F "text=Xin chào" \
-F "voice=Bích Ngọc (Nữ - Miền Bắc)" \
--output response.wav| Name | Gender | Dialect |
|---|---|---|
Bích Ngọc (Nữ - Miền Bắc) |
Female | Northern (default) |
Phạm Tuyên (Nam - Miền Bắc) |
Male | Northern |
Thục Đoan (Nữ - Miền Nam) |
Female | Southern |
Xuân Vĩnh (Nam - Miền Nam) |
Male | Southern |
python pipeline.py demo/audio/question.wav --image demo/images/IMG_6817.jpgInput image:
1. STT - Audio input:
question.mp4
[STT] Transcribing ...
[STT] 'mô tả khung cảnh trước mặt tôi' (0.13s)
2. VLM - image analysis:
[VLM] Analyzing image ...
[VLM] Đây là một phòng họp với một người đàn ông đang ngồi trước màn hình máy tính. Trên bàn có một máy tính xách tay
[VLM] 29 tokens | 16.68s | 1.7 tok/s (16.78s total)
3. TTS — audio output:
[TTS] Synthesizing ...
[TTS] Saved → data/audio/output.wav (4.46s)
output.mp4
[pipeline] Total: 21.38s
npm install -g pm2
pm2 start server.py --interpreter $(which python) --name xeye --cwd /home/ubuntu/xeye
pm2 save
pm2 startupxeye
├─ demo
│ ├─ audio/ # Demo WAV files
│ ├─ images/ # Demo images
│ └─ video/ # Generated MP4s for README
├─ docs
│ ├─ bao_cao_ky_thuat.md # Technical report (Vietnamese)
│ └─ technical_report.md # Technical report (English)
├─ models
│ ├─ vintern/ # Vintern-1B-v3_5 GGUF + mmproj
│ ├─ stt/ # ZipFormer RNNT int8 ONNX
│ └─ tts/ # VieNeu-TTS-v2-Turbo GGUF + voices
├─ scripts
│ ├─ stt_infer.py # STT test script
│ ├─ vlm_infer.py # VLM test script
│ └─ tts_infer.py # TTS test script
├─ server.py # FastAPI server (STT / VLM / TTS endpoints)
├─ pipeline.py # Demo pipeline (WAV + image → text + audio)
├─ download_models.py # Download all models from HuggingFace
├─ setup.sh # Create conda env + install deps
└─ requirements.txt
