tiny-llm - LLM Serving in a Week

Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

Book

The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Roadmap

Week + Chapter	Topic	Code	Test	Doc
1.1	Attention	✅	✅	✅
1.2	RoPE	✅	✅	✅
1.3	Grouped Query Attention	✅	✅	✅
1.4	RMSNorm and MLP	✅	✅	✅
1.5	Transformer Block	✅	🚧	🚧
1.6	Load the Model	✅	🚧	🚧
1.7	Generate Responses (aka Decoding)	✅	✅	🚧
2.1	Key-Value Cache	✅	🚧	🚧
2.2	Quantized Matmul and Linear - CPU	✅	🚧	🚧
2.3	Quantized Matmul and Linear - GPU	✅	🚧	🚧
2.4	Flash Attention 2 - CPU	✅	🚧	🚧
2.5	Flash Attention 2 - GPU	✅	🚧	🚧
2.6	Continuous Batching	✅	🚧	🚧
2.7	Chunked Prefill	✅	🚧	🚧
3.1	Paged Attention - Part 1	🚧	🚧	🚧
3.2	Paged Attention - Part 2	🚧	🚧	🚧
3.3	MoE (Mixture of Experts)	🚧	🚧	🚧
3.4	Speculative Decoding	🚧	🚧	🚧
3.5	Prefill-Decode Separation (requires two Macintosh devices)	🚧	🚧	🚧
3.6	Parallelism	🚧	🚧	🚧
3.7	AI Agent / Tool Calling	🚧	🚧	🚧

Other topics not covered: quantized/compressed kv cache, prefix/prompt cache; sampling, fine tuning; smaller kernels (softmax, silu, etc)

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benches		benches
book		book
scripts		scripts
src		src
tests		tests
tests_refsol		tests_refsol
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch-main.py		batch-main.py
check.py		check.py
main.py		main.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

About

Releases

Contributors 4

Languages

License

skyzh/tiny-llm

Folders and files

Latest commit

History

Repository files navigation

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 4

Languages