XEye

XEye is currently among the Top 28 shortlisted startups of Qualcomm® Vietnam Innovation Challenge (QVIC) 2026.

XEye is a wearable, on-device AI assistant for the visually impaired, built on Qualcomm Dragonwing™ QCS6490 Platform - Thundercomm RUBIK Pi 3.

Technical Notes: English · Tiếng Việt (Last updated: 25/05/2026)

Pipeline

Mic ─────────→ STT ─────────→  VI question (text input)
                                        ↓
Camera ────→ Image input ────→ Vision Language Model
                                        ↓
                               VI answer (text output)
                                        ↓
                                       TTS ────→ VI answer (audio output) ────→ Speaker playback

Hardware


Board	Thundercomm RUBIK Pi 3
SoC	Qualcomm QCS6490
CPU	4× Cortex-A55 @1.96GHz + 3× Cortex-A78 @2.40GHz + 1× Cortex-X1 @2.71GHz
RAM	8GB LPDDR4x
NPU	Hexagon 780 (V73) — 12 TOPS
GPU	Adreno 643
OS	Ubuntu (Linux 6.8.0-1071-qcom)

Models

Module	Model	Runtime	Quantization	Performance
STT	ZipFormer-30M RNNT	sherpa-onnx	int8	RTF <0.1x
VLM	Vintern-1B-v3_5	llama-cpp-python	Q4_K_M	~2.8-3.0 tok/s
TTS	VieNeu-TTS-v2-Turbo	llama-cpp-python + VieNeu-Codec ONNX	Q4_K_M	~4-5s latency

Why CPU-only?
The QCS6490's Hexagon NPU (12 TOPS) is designed for computer vision inference (object detection, classification) — not LLM/VLM workloads. It does not efficiently support attention mechanisms, dynamic KV cache, or large matrix multiplications required by language models. The Adreno GPU shares system RAM, making it unsuitable for models that require several GB of memory. After extensive testing across multiple approaches (llama.cpp Hexagon backend, ONNX Runtime QNN EP, Qualcomm AI Hub), CPU inference with highly quantized models was the only viable path on this hardware.

Setup

# 1. Create environment and install dependencies
bash setup.sh

# 2. Activate environment
conda activate xeye

# 3. Download all models (~1.3GB total)
python download_models.py

# 4. Start server
python server.py

Testing AI Services

Server must be running before using any script (python server.py).

STT

python scripts/stt_infer.py --file data/audio/question.wav

VLM

# Describe image (no question — uses default prompt)
python scripts/vlm_infer.py data/images/IMG_6817.jpg

# Ask a Vietnamese question
python scripts/vlm_infer.py data/images/IMG_6817.jpg --question "Đây là gì?"

TTS

# Synthesize to file
python scripts/tts_infer.py "Xin chào" --out data/audio/output.wav

# Synthesize with a different voice
python scripts/tts_infer.py "Xin chào" --out data/audio/output.wav --voice "Phạm Tuyên (Nam - Miền Bắc)"

API Server

Runs at http://0.0.0.0:8000

Endpoint	Method	Input	Output
`/health`	GET	—	`{"status": "ok"}`
`/stt`	POST	WAV file (multipart)	`{"text": "..."}`
`/vlm`	POST	`image` (file), `question` (Vietnamese, optional)	`{"vi": "...", "tokens": N, "elapsed_s": N, "tok_s": N}`
`/tts`	POST	`text` (Vietnamese), `voice` (name, optional)	WAV audio bytes

Example calls

# Health check
curl http://localhost:8000/health

# STT
curl -X POST http://localhost:8000/stt \
  -F "audio=@data/audio/question.wav"

# VLM
curl -X POST http://localhost:8000/vlm \
  -F "image=@data/images/IMG_6817.jpg" \
  -F "question=Đây là gì?"

# TTS
curl -X POST http://localhost:8000/tts \
  -F "text=Xin chào" \
  -F "voice=Bích Ngọc (Nữ - Miền Bắc)" \
  --output response.wav

TTS Voices

Name	Gender	Dialect
`Bích Ngọc (Nữ - Miền Bắc)`	Female	Northern (default)
`Phạm Tuyên (Nam - Miền Bắc)`	Male	Northern
`Thục Đoan (Nữ - Miền Nam)`	Female	Southern
`Xuân Vĩnh (Nam - Miền Nam)`	Male	Southern

Demo

python pipeline.py demo/audio/question.wav --image demo/images/IMG_6817.jpg

Input image:

1. STT - Audio input:

question.mp4

[STT] Transcribing ...
[STT] 'mô tả khung cảnh trước mặt tôi'  (0.13s)

2. VLM - image analysis:

[VLM] Analyzing image ...
[VLM] Đây là một phòng họp với một người đàn ông đang ngồi trước màn hình máy tính. Trên bàn có một máy tính xách tay
[VLM] 29 tokens | 16.68s | 1.7 tok/s  (16.78s total)

3. TTS — audio output:

[TTS] Synthesizing ...
[TTS] Saved → data/audio/output.wav  (4.46s)

output.mp4

[pipeline] Total: 21.38s

Run as Service (PM2)

npm install -g pm2
pm2 start server.py --interpreter $(which python) --name xeye --cwd /home/ubuntu/xeye
pm2 save
pm2 startup

Project Structure

xeye
├─ demo
│  ├─ audio/                    # Demo WAV files
│  ├─ images/                   # Demo images
│  └─ video/                    # Generated MP4s for README
├─ docs
│  ├─ bao_cao_ky_thuat.md       # Technical report (Vietnamese)
│  └─ technical_report.md       # Technical report (English)
├─ models
│  ├─ vintern/                  # Vintern-1B-v3_5 GGUF + mmproj
│  ├─ stt/                      # ZipFormer RNNT int8 ONNX
│  └─ tts/                      # VieNeu-TTS-v2-Turbo GGUF + voices
├─ scripts
│  ├─ stt_infer.py              # STT test script
│  ├─ vlm_infer.py              # VLM test script
│  └─ tts_infer.py              # TTS test script
├─ server.py                    # FastAPI server (STT / VLM / TTS endpoints)
├─ pipeline.py                  # Demo pipeline (WAV + image → text + audio)
├─ download_models.py           # Download all models from HuggingFace
├─ setup.sh                     # Create conda env + install deps
└─ requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XEye

Pipeline

Hardware

Models

Setup

Testing AI Services

STT

VLM

TTS

API Server

Example calls

TTS Voices

Demo

Run as Service (PM2)

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
demo		demo
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_models.py		download_models.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
server.py		server.py
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

XEye

Pipeline

Hardware

Models

Setup

Testing AI Services

STT

VLM

TTS

API Server

Example calls

TTS Voices

Demo

Run as Service (PM2)

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages