Krasis

Krasis is an LLM runtime for running large MoE models on NVIDIA consumer GPUs. It is built around fast GPU prompt processing, GPU-executed decode, and HCS expert residency management so models much larger than VRAM can still run locally.

The current runtime is no longer the early Python-hot-path prototype. The serving path is Rust/CUDA focused: Python is used for launcher/setup/model loading work, while the performance-sensitive runtime path uses Rust/CUDA orchestration, CUDA kernels, cached quantized weights, and measured VRAM budgeting.

You can contact me here, but for bugs, setup problems, model requests, or feature requests please open a GitHub issue.

If you want to monitor Krasis during runs, check out ktop.

What Krasis Does

Runs multi-hundred-billion-parameter MoE models from BF16 safetensors on commodity NVIDIA GPU systems.
Uses full GPU prefill for fast prompt processing.
Uses GPU-executed decode with HCS managing hot/cold expert residency between VRAM and CPU RAM.
Builds cached INT4/INT8 expert formats and HQQ attention caches under ~/.krasis.
Supports compact KV cache modes including k6v6 Quality and k4v4 Ultra Compact.
Provides an interactive launcher, OpenAI-compatible API, chat client, reproducible benchmarks, and GitHub-release based installation.

Major Changes Since The Previous Stable Release

The current release line is a major change from v0.1.64, the previous stable Krasis release. Highlights:

Runtime hot-path work moved out of Python and into Rust/CUDA for serving, decode orchestration, timing, HCS operations, and benchmark-critical paths.
Added HQQ attention support, including HQQ4, HQQ6, HQQ8, auto mixed profiles, cache build/rebuild support, and HQQ benchmark/validation lanes.
Added compact KV cache formats, including k6v6 and k4v4, with k6v6 as the quality-oriented launcher default and k4v4 for tighter VRAM budgets.
Added full Ampere support for the current production path. HQQ attention and compact KV cache modes were built with Ampere compatibility in mind and do not require FP8-capable hardware.
Added Qwen3.6-35B-A3B support, including HQQ4/k4v4 RTX 5090/5080 benchmark coverage and HQQ6/k6v6 RTX A4500 Ampere coverage.
Added and hardened HCS expert residency management: measured startup calibration, prompt-conditioned reload, dynamic recency tail, per-stage budgets, soft-tier reload caps, and safe eviction/reload paths.
Added runtime VRAM safety systems: short/long prefill/decode calibration, measured scratch budgets, pressure detection, idle pressure drain, and hard exit protection before CUDA enters an unsafe OOM state.
Added full GitHub release wheel packaging for Python 3.10, 3.11, 3.12, and 3.13, with vendored CUDA sidecars injected into wheels.
Added krasis update and krasis prerelease maintenance commands.
Added an interactive Hugging Face downloader/search flow in the launcher.
Added reverse SSH tunnel support for exposing a local Krasis server to a remote machine through SSH without opening public ports.
Added repeatable benchmark and release-test commands, benchmark log archival, llama-witness based correctness validation, and richer diagnostics.
Removed Session messenger integration and other stale prototype-era surfaces.
Cleaned terminal/log output so human console lines are clean while prefixed records go to log files.
Deprecated AWQ and Polar4 for new production runs. Current production surfaces use HQQ attention plus k6v6, k4v4, or BF16 KV depending on the memory/quality target.

Benchmarks

Selected current timing-disabled results. Decode is the internal engine measurement; HTTP round trip includes local client/server HTTP overhead.

Hardware	Model	Params	Attention	KV	Prefill	Decode	HTTP round trip
RTX 5090 32 GB	Qwen3.6-35B-A3B	35B	HQQ4	k4v4	10030.3 tok/s	124.88 tok/s	267.00 tok/s
RTX 5090 32 GB	Qwen3-Coder-Next	80B	HQQ8	k4v4	6111.2 tok/s	88.59 tok/s	157.00 tok/s
RTX 5090 32 GB	Qwen3.5-122B-A10B	122B	HQQ6	k4v4	4880.4 tok/s	25.29 tok/s	44.95 tok/s
RTX 5090 32 GB	Qwen3-235B-A22B	235B	HQQ6	k4v4	1459.1 tok/s	3.54 tok/s	6.17 tok/s
RTX A4500 20 GB	Qwen3.6-35B-A3B	35B	HQQ6	k6v6	2235.2 tok/s	50.98 tok/s	103.98 tok/s
RTX A4500 20 GB	Qwen3-Coder-Next	80B	HQQ6	k4v4	1569.5 tok/s	34.69 tok/s	60.47 tok/s
RTX 5080 16 GB	Qwen3.6-35B-A3B	35B	HQQ4	k4v4	3743.5 tok/s	60.04 tok/s	128.55 tok/s
RTX 3070 Laptop 8 GB	Qwen3.5-35B-A3B	35B	HQQ4	k4v4	222.1 tok/s	12.48 tok/s	22.00 tok/s

Tradeoffs And Requirements

Krasis currently targets NVIDIA GPUs with CUDA, including Ampere and newer architectures. The production HQQ attention and compact KV cache modes do not require FP8 support.
Input models should be BF16 safetensors from Hugging Face or another local safetensors source.
First run is slower because Krasis builds optimized local caches. Later runs reuse those caches.
Disk usage must cover the source model plus Krasis cache artifacts under ~/.krasis.
System RAM should be sized for the selected quantized cache and HCS backing store. Larger models need substantial RAM even when GPU VRAM is limited.
Production runs should use quantized INT4/INT8 expert caches and HQQ attention. BF16-heavy modes are validation/debug modes, not normal deployment targets.

Quick Start

Requirements

Linux, including Ubuntu 24.04+ or WSL2 on Windows
Python 3.10+
NVIDIA GPU with CUDA drivers installed
Rust is only needed for source builds, not normal wheel installs
Enough disk/RAM for the source model and generated Krasis caches

1. Install Krasis

curl -sSf https://raw.githubusercontent.com/brontoguana/krasis/main/install.sh | bash

This creates a managed environment at ~/.krasis/venv, installs Krasis, symlinks commands into ~/.local/bin, and updates PATH for the current shell. No sudo is required for the Krasis install itself.

2. Install CUDA Dependencies

krasis-setup

This installs runtime CUDA/PyTorch dependencies when needed. It is usually only required once per machine.

3. Download A Model

Run:

krasis

Then use the interactive launcher to search/download supported Hugging Face models, or put BF16 safetensors manually under ~/.krasis/models/.

Manual download example:

huggingface-cli download Qwen/Qwen3-Coder-Next \
    --local-dir ~/.krasis/models/Qwen3-Coder-Next

4. Run

krasis

The launcher walks through model selection, GPU selection, quantization/runtime options, and server startup. Settings are saved under ~/.krasis/config.

Updating

# Latest stable release
krasis update

# Latest pre-release
krasis prerelease

# Uninstall Krasis, keeping model files
curl -sSf https://raw.githubusercontent.com/brontoguana/krasis/main/install.sh | bash -s -- --uninstall

WSL2

Krasis works on WSL2. By default WSL often limits available memory, which is usually too small for large MoE models. Create or edit:

C:\Users\<YourUsername>\.wslconfig

Example:

[wsl2]
memory=120GB

Adjust the value to leave memory for Windows, then restart WSL from PowerShell:

wsl --shutdown

Usage

Interactive Launcher

krasis

The launcher provides:

model selection from local models
Hugging Face model search and download
GPU selection, including selected GPU indices
quantization, HQQ attention, KV cache, HCS, and VRAM safety settings
optional reverse SSH tunnel target
benchmark/run choices

Non-Interactive Launch

# Use saved config
krasis --non-interactive

# Use a config file
krasis --config tests/qcn-k4v4-hqq8-int4-benchmark.conf

# Override selected values
krasis --non-interactive --model-path /path/to/model --selected-gpus 0,2 --benchmark

Common options:

--attention-quant hqq6 or hqq8
--kv-dtype k6v6, k4v4, or bf16
--gpu-expert-bits 4 or 8
--vram-safety-margin 600
--dynamic-hcs / --no-dynamic-hcs
--ssh-tunnel user@host

For the full option surface, run:

krasis --help

Chat Client

krasis chat
krasis chat --prompt "Explain HCS in one paragraph"
krasis chat --file prompts.txt
krasis chat --port 8013
krasis chat --url http://host:8012

The standalone command also remains available:

krasis-chat

API

Krasis exposes an OpenAI-compatible chat endpoint:

http://localhost:8012/v1/chat/completions

Useful endpoints:

GET /health
GET /v1/models
POST /v1/timing

Benchmarks

Use the fixed speed-regression entry point for repeatable Qwen3-Coder-Next speed checks:

./dev speed-test

Run a standard benchmark for a config:

./dev benchmark tests/qcn-k4v4-hqq8-int4-benchmark.conf

Run a benchmark from the installed command:

krasis --config tests/qcn-k4v4-hqq8-int4-benchmark.conf --benchmark

Source Build

For development builds:

git clone https://github.com/brontoguana/krasis.git
cd krasis
./dev build
./dev run qcn

The ./dev entry point handles environment setup and is preferred for local development commands.

Advanced Documentation

See ADVANCED.md for detailed config options, quantization modes, HQQ cache controls, HCS controls, benchmarking commands, and API details.

License

SSPL-1.0

Krasis is free to use, modify, and distribute.

If you want to support the project or offer Krasis as part of a commercial product or a hosted/managed service, please get in touch.

Name		Name	Last commit message	Last commit date
Latest commit History 598 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
dist_check		dist_check
logs		logs
perplexity		perplexity
podman		podman
python/krasis		python/krasis
release_inspect		release_inspect
scripts		scripts
src		src
target_prerelease_check		target_prerelease_check
target_py311		target_py311
templates/attention		templates/attention
testconfigs		testconfigs
tests		tests
tools		tools
.gitignore		.gitignore
ADVANCED.md		ADVANCED.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEV.md		DEV.md
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
build.rs		build.rs
dev		dev
dump_trace.sh		dump_trace.sh
fix-oomd.sh		fix-oomd.sh
gpu_cleanup.sh		gpu_cleanup.sh
gpu_reset.sh		gpu_reset.sh
install.sh		install.sh
krasis		krasis
krasis-chat		krasis-chat
krasis_server.png		krasis_server.png
krasis_server_2.png		krasis_server_2.png
krasis_server_3.png		krasis_server_3.png
perf_diff_20260330_b8c071d_vs_worktree.patch		perf_diff_20260330_b8c071d_vs_worktree.patch
perf_diff_20260330_b8c071d_vs_worktree_review.txt		perf_diff_20260330_b8c071d_vs_worktree_review.txt
pyproject.toml		pyproject.toml
run_benchmark.sh		run_benchmark.sh
setup_pcie.sh		setup_pcie.sh
sidecar_abi_version.txt		sidecar_abi_version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Krasis

What Krasis Does

Major Changes Since The Previous Stable Release

Benchmarks

Tradeoffs And Requirements

Quick Start

Requirements

1. Install Krasis

2. Install CUDA Dependencies

3. Download A Model

4. Run

Updating

WSL2

Usage

Interactive Launcher

Non-Interactive Launch

Chat Client

API

Benchmarks

Source Build

Advanced Documentation

License

About

Uh oh!

Releases 131

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Krasis

What Krasis Does

Major Changes Since The Previous Stable Release

Benchmarks

Tradeoffs And Requirements

Quick Start

Requirements

1. Install Krasis

2. Install CUDA Dependencies

3. Download A Model

4. Run

Updating

WSL2

Usage

Interactive Launcher

Non-Interactive Launch

Chat Client

API

Benchmarks

Source Build

Advanced Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 131

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages