This repo is the engine
Own GGUF -> MLF2 loader, own Q4_K/Q6_K + int8 kernels,
own BPE tokenizer, own Qwen2 forward pass. No ggml, no llama.cpp, no torch, no
ollama at runtime. libc + libm + pthread, that's the stack.
It runs the hako models end to end and is spawned by hako-code
(the agent) and hako-edit (the editor)
as a one-shot hakm --chat-stdin subprocess.
Weights aren't stored here; the engine pulls them from HuggingFace on request
(hako :pull <id>, or download + convert yourself). Hosted under
huggingface.co/mithraeum:
| model | tier | base | license |
|---|---|---|---|
hako-sho |
mini · 3B | Qwen2.5-Coder-3B-Instruct | [ ! ] Qwen-research (non-commercial) |
hako-koi |
mid · 7B | Qwen2.5-Coder-7B-Instruct | Apache-2.0 |
Future tiers (14B/32B fine-tune, 50B+ max) are queued. Stock wraps carry no
version; the first real fine-tune of a tier earns v0.0.1.
git clone https://github.com/mithraeums/hako && cd hako
make # builds ./hakm — libc + libm + pthread, no deps
# convert a Qwen2.5-Coder GGUF (HF download, or an existing ollama blob) once:
python3 tools/gguf2mlf.py model.gguf ~/.hako/models/hako-sho.mlf2
./hakm ~/.hako/models/hako-sho.mlf2 --raw -t 0 "def fibonacci(n):" # raw completion
./hakm ~/.hako/models/hako-sho.mlf2 --sys "You are hako." "ring buffers in C" # one chat turn
./hakm ~/.hako/models/hako-sho.mlf2 --sys "You are hako." # interactive REPL
# flags: -n new-tokens -t temp(0=greedy) -p top_p -k top_k -s seed --raw --info --chat-stdinOr just let the agent fetch a model: hako → :pull hako-sho.
-
Offline conversion -
tools/gguf2mlf.pyreads a GGUF and emits an MLF2 file: the native container. Quant blocks copied verbatim (ggml's k-quant layout, preserved), tokenizer (vocab + merges) embedded, arch params in a fixed 128-byte header. Pure stdlib; no torch/llama.cpp dependency. -
Runtime (C11, libc + libm + pthread):
loader.c- mmap the MLF2, resolve tensors by name. RSS = activations + KV.quant.c- Q4_K / Q6_K dequant + int8 fast-path dot. Bit-exact vs an independent Python reimpl (seetests/).nn.c- rmsnorm, softmax, SiLU, NeoX rope, dequant-on-the-fly + int8 matmul.model.c- Qwen2 forward: GQA, QKV bias, KV cache, lm_head (tied to token_embd on small tiers, separateoutput.weighton 7B+).bpe.c- Qwen2 byte-level BPE (GPT2 byte map + rank-greedy merges).cli/main.c- thehakmCLI (one-shot / chat REPL /--chat-stdin).
Runs the real Qwen2.5-Coder 3B and 7B weights end to end (greedy + sampling, ChatML, multi-turn REPL):
- 3B (
hako-sho): qwen2 · 36 layers · d_model 2048 · 16/2 heads (GQA) · ffn 11008 · vocab 151936 · tied embeddings. - 7B (
hako-koi): qwen2 · 28 layers · d_model 3584 · 28/4 heads (GQA) · ffn 18944 · vocab 152064 · untied (output.weight).
Quant correctness gate: C output bit-exact vs tests/q_ref.py (maxdiff 0.0).
2.2–2.5 tok/s wall on a 4-core x86_64 (greedy, 3B, Q4_K_M), up from 1.07; a
~2.1× win from the int8 fast path: quantize the activation to int8 once
(quantize_row_q8_32), dot it straight against the 4-bit weights (q4k_vec_dot)
weights stay exact, output bit-identical to the float path. AVX2 kernel
(_mm256_maddubs_epi16) + a dormant NEON vdotq_s32 path for arm64. Matmul is
multithreaded over output rows (persistent pthread pool), built -march=native.
measured thread scaling (T=1/2/4 -> 0.73/1.44/2.44) shows it's compute-bound near the 4-core ceiling. Further speed needs less work (speculative decode, smaller draft model) or wider SIMD, not more threads. This is correctness-first; the goal is owning the stack, not beating tuned GPU runtimes.
- Tokenizer pretokenizer is simplified (whitespace-delimited; merges, symbols, special tokens correct). Needs the full Qwen2 regex for digit/punctuation-run boundaries to exactly match reference tokenization.
- Windows: the loader uses POSIX
mmap; MinGW lackssys/mman.h. An#ifdef _WIN32mmap shim is the unlock. Linux/macOS/FreeBSD build today. - Next SIMD: NEON for arm64, tighter AVX2, then speculative decode for the real ceiling break.
- Engine code - GPL-3.0 (
LICENSE). From-scratch C; no ggml/llama.cpp/torch. - Model weights keep their required upstream licenses (on HuggingFace, not
here):
hako-koi(7B) Apache-2.0;hako-sho(3B) Qwen RESEARCH — non-commercial only (the 3B base is research-licensed). See each model repo'sLICENSE/NOTICE.