Skip to content

Jermalk/stormvino

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

333 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mentioned in Awesome OpenVINO

Stormvino

OpenAI-compatible LLM server for Intel Arc GPUs. Runs local inference via OpenVINO. Speaks the OpenAI API — drop it behind any client that accepts a base_url. No NVIDIA required.


Hardware compatibility

GPU VRAM Status Notes
Arc B60 24 GB ✅ Production EnvyStorm reference machine
Arc B50 16 GB 🔜 Testing TinyB — install in progress
Arc B65 TBD 🔜 Planned Next after B50 confirmed
Arc B70 TBD 🔜 Planned
Other Arc any ⚙️ Auto-tuned VRAM detected at runtime

Detecting B-series cards: Battlemage GPUs often report as Intel(R) Graphics [0xExxx] (e.g. [0xe212]) — not the word "Arc"; lspci and the OpenVINO device name both omit it. Identify the discrete GPU by its OpenVINO device type (DISCRETE vs INTEGRATED), not by matching "Arc". If a detection step reports "no Arc GPU found" on a B-series card, the card is still fine — confirm with clinfo or python -c "import openvino as ov; print(ov.Core().available_devices)" and continue.

OS: Linux Mint 22.x / Ubuntu 24.04 (Noble). Kernel: Battlemage (B-series) needs the xe driver. linux-oem-24.04 provides it — but a newer generic/mainline kernel (6.11+) that already loads xe and creates a /dev/dri/renderD* node for the card works too. The installer checks whether the GPU is already live and upgrades the kernel only if it isn't — so a working newer kernel won't be downgraded. System RAM: 16 GB minimum (a 16 GB machine reports ~15 GiB usable). Disk: 50 GB+ for a useful model set.


Install paths — pick one

🤖 Claude Code (recommended for single machine)

Fully automated. CC asks 3 questions, then handles everything — including a kernel upgrade + reboot only if your GPU isn't already working. You watch.

Step 1 — Install Claude Code if you haven't:

npm install -g @anthropic-ai/claude-code

Prerequisite — passwordless sudo for the install. The automated path runs system commands via sudo, and Claude Code's non-interactive shell can't answer a password prompt. Grant a temporary drop-in and remove it when the install finishes:

echo "$USER ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/stormvino-install
sudo chmod 0440 /etc/sudoers.d/stormvino-install
# when the install is done:  sudo rm /etc/sudoers.d/stormvino-install

Step 2 — Clone the repo into your home dir and start CC there. Don't clone into /opt — it's root-owned, so the clone fails; the runbook creates and owns /opt/ov_server for you during install:

git clone https://github.com/Jermalk/stormvino.git ~/stormvino
cd ~/stormvino
claude

Step 3 — In the CC chat, type exactly:

Run the Stormvino installation runbook. @CC_INSTALL.md

The @CC_INSTALL.md mention loads the runbook directly — no file dragging needed. CC reads it and takes over. Answer the 3 questions it asks, then watch.

→ See CC_INSTALL.md for what CC does at each phase.

⚙️ Ansible (recommended for multiple machines / repeatable deploys)

One command installs on any number of Arc machines simultaneously. Detects GPU VRAM at runtime and tunes config automatically. Fully headless — handles reboots without human intervention.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
# edit vars/main.yml (3 lines) — then:
ansible-playbook -i hosts.yml stormvino.yml

→ See ANSIBLE.md for the full plan and current implementation status.

📖 Manual (full control, learn every step)

Step-by-step guide with a verification test between every phase. Covers kernel, drivers, Python env, PostgreSQL, models, and systemd services.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
./install.sh    # detects hardware, routes to the right path

→ See INSTALL.md.


What you get

Endpoint Description
POST /v1/chat/completions OpenAI-compatible chat, streaming supported
POST /v1/embeddings Sentence embeddings (multilingual-e5-large)
GET /v1/models List discovered models
POST /v1/images/generations Image generation (SDXL, optional)
POST /v1/audio/transcriptions Speech-to-text (Whisper, optional)
POST /v1/audio/speech Text-to-speech (Kokoro / Piper, optional)
GET /health Server health + loaded models + VRAM stats
GET /monitor Web dashboard — live VRAM, throughput, request log

Default port: 11435. Accessible over LAN. Runs as an unprivileged stormvino systemd service (not root); the embedding model is offloaded to the iGPU when present, leaving the Arc's full VRAM for the LLM.

Tested models (B60 / 24 GB VRAM)

Model VRAM Role
qwen3-14b-int4-ov 9.1 GB Default — reasoning, coding, chat
qwen3-8b-int4-ov 4.6 GB Agent turns, fast responses
multilingual-e5-large-int8 563 MB Embeddings + task routing
whisper-large-v3-int8-ov ~2 GB Speech-to-text
qwen2.5-vl-7b-int4-ov ~5 GB Vision — image understanding

→ See MODELS.md for conversion instructions and VRAM budget tables.


Quick health check

curl -s http://localhost:11435/health | python3 -m json.tool
curl -s http://localhost:11435/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3-8b-int4-ov","messages":[{"role":"user","content":"Hello"}]}'

Libraries stack

Inference (server runtime)

Library Version
openvino 2026.1.0
openvino-genai 2026.1.0.0
openvino-tokenizers 2026.1.0.0
infergate 0.2.0
optimum-intel 1.27.0
optimum 2.1.0
transformers 4.57.6
tokenizers 0.22.2

Model conversion (offline, via optimum-cli)

Library Version
nncf 3.1.0
onnx 1.21.0
onnxruntime 1.25.0
safetensors 0.7.0
huggingface_hub 0.36.2

Configuration

Runtime settings live in config.json. Key settings auto-patched by the installers based on detected GPU VRAM:

Key Description
device OpenVINO device — auto-detected (e.g. GPU.1)
kv_cache_size_gb KV cache per model — tuned to VRAM tier
max_loaded_models Models held in VRAM simultaneously
default_model Model used when client doesn't specify
embedding_model Embedding model directory name
postgres_dsn Observability database connection string

Full reference: INSTALL.md § Phase 7.


Architecture

Layer Component
HTTP FastAPI + Uvicorn, single worker
LLM inference openvino_genai.LLMPipeline, executor-offloaded
VLM inference openvino_genai.VLMPipeline
Embeddings OVModelForFeatureExtraction (optimum-intel)
Task routing Embedding similarity + signal detection
STT openvino_genai.WhisperPipeline
TTS Kokoro-ONNX (EN) + Piper (PL)
Observability PostgreSQL 16 + pgvector
Monitor UI Svelte + uPlot

Hardware reports welcome

Tested Stormvino on a GPU not in the compatibility table? Open a hardware report issue — GPU model, VRAM, kernel version, tokens/sec. Builds the matrix for everyone.


Origin

Stormvino grew out of Shangri-Lab — a personal lab built by an IT architect from Silesia who had no Python background, a pair of Intel Arc GPUs, and a firm belief that local inference shouldn't require Nvidia hardware or magic frameworks.

The philosophy is unchanged: build the simplest thing that gives full visibility first, tune quality only after you can observe it.

Built with Claude Code.

About

OpenAI API server for OpenVino - when OVMS is too big

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors