GitHub - mithraeums/hako: Open source hako LLMs.

hako: a from-scratch C inference engine for the hako model family. hakm.

_{site · hako-code (agent) · hako-edit (editor) · org}

This repo is the engine
Own GGUF -> MLF2 loader, own Q4_K/Q6_K + int8 kernels, own BPE tokenizer, own Qwen2 forward pass. No ggml, no llama.cpp, no torch, no ollama at runtime. libc + libm + pthread, that's the stack.

It runs the hako models end to end and is spawned by hako-code (the agent) and hako-edit (the editor) as a one-shot hakm --chat-stdin subprocess.

Models

Weights aren't stored here; the engine pulls them from HuggingFace on request (hako :pull <id>, or download + convert yourself). Hosted under huggingface.co/mithraeum:

model	tier	base	license
`hako-sho`	mini · 3B	Qwen2.5-Coder-3B-Instruct	[ ! ] Qwen-research (non-commercial)
`hako-koi`	mid · 7B	Qwen2.5-Coder-7B-Instruct	Apache-2.0

Future tiers (14B/32B fine-tune, 50B+ max) are queued. Stock wraps carry no version; the first real fine-tune of a tier earns v0.0.1.

Build & run

git clone https://github.com/mithraeums/hako && cd hako
make                                            # builds ./hakm — libc + libm + pthread, no deps

# convert a Qwen2.5-Coder GGUF (HF download, or an existing ollama blob) once:
python3 tools/gguf2mlf.py model.gguf ~/.hako/models/hako-sho.mlf2

./hakm ~/.hako/models/hako-sho.mlf2 --raw -t 0 "def fibonacci(n):"      # raw completion
./hakm ~/.hako/models/hako-sho.mlf2 --sys "You are hako." "ring buffers in C"  # one chat turn
./hakm ~/.hako/models/hako-sho.mlf2 --sys "You are hako."               # interactive REPL
# flags: -n new-tokens  -t temp(0=greedy)  -p top_p  -k top_k  -s seed  --raw --info --chat-stdin

Or just let the agent fetch a model: hako → :pull hako-sho.

How it works

Offline conversion - tools/gguf2mlf.py reads a GGUF and emits an MLF2 file: the native container. Quant blocks copied verbatim (ggml's k-quant layout, preserved), tokenizer (vocab + merges) embedded, arch params in a fixed 128-byte header. Pure stdlib; no torch/llama.cpp dependency.
Runtime (C11, libc + libm + pthread):
- loader.c - mmap the MLF2, resolve tensors by name. RSS = activations + KV.
- quant.c - Q4_K / Q6_K dequant + int8 fast-path dot. Bit-exact vs an independent Python reimpl (see tests/).
- nn.c - rmsnorm, softmax, SiLU, NeoX rope, dequant-on-the-fly + int8 matmul.
- model.c - Qwen2 forward: GQA, QKV bias, KV cache, lm_head (tied to token_embd on small tiers, separate output.weight on 7B+).
- bpe.c - Qwen2 byte-level BPE (GPT2 byte map + rank-greedy merges).
- cli/main.c - the hakm CLI (one-shot / chat REPL / --chat-stdin).

Validated

Runs the real Qwen2.5-Coder 3B and 7B weights end to end (greedy + sampling, ChatML, multi-turn REPL):

3B (hako-sho): qwen2 · 36 layers · d_model 2048 · 16/2 heads (GQA) · ffn 11008 · vocab 151936 · tied embeddings.
7B (hako-koi): qwen2 · 28 layers · d_model 3584 · 28/4 heads (GQA) · ffn 18944 · vocab 152064 · untied (output.weight).

Quant correctness gate: C output bit-exact vs tests/q_ref.py (maxdiff 0.0).

Speed

2.2–2.5 tok/s wall on a 4-core x86_64 (greedy, 3B, Q4_K_M), up from 1.07; a ~2.1× win from the int8 fast path: quantize the activation to int8 once (quantize_row_q8_32), dot it straight against the 4-bit weights (q4k_vec_dot)

weights stay exact, output bit-identical to the float path. AVX2 kernel (_mm256_maddubs_epi16) + a dormant NEON vdotq_s32 path for arm64. Matmul is multithreaded over output rows (persistent pthread pool), built -march=native.

measured thread scaling (T=1/2/4 -> 0.73/1.44/2.44) shows it's compute-bound near the 4-core ceiling. Further speed needs less work (speculative decode, smaller draft model) or wider SIMD, not more threads. This is correctness-first; the goal is owning the stack, not beating tuned GPU runtimes.

Known gaps / next

Tokenizer pretokenizer is simplified (whitespace-delimited; merges, symbols, special tokens correct). Needs the full Qwen2 regex for digit/punctuation-run boundaries to exactly match reference tokenization.
Windows: the loader uses POSIX mmap; MinGW lacks sys/mman.h. An #ifdef _WIN32 mmap shim is the unlock. Linux/macOS/FreeBSD build today.
Next SIMD: NEON for arm64, tighter AVX2, then speculative decode for the real ceiling break.

License

Engine code - GPL-3.0 (LICENSE). From-scratch C; no ggml/llama.cpp/torch.
Model weights keep their required upstream licenses (on HuggingFace, not here): hako-koi (7B) Apache-2.0; hako-sho (3B) Qwen RESEARCH — non-commercial only (the 3B base is research-licensed). See each model repo's LICENSE/NOTICE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
cli		cli
include		include
src		src
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Models

Build & run

How it works

Validated

Speed

Known gaps / next

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Models

Build & run

How it works

Validated

Speed

Known gaps / next

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages