llm-tuning

Serving and fine-tuning code for the Google Gemma 4 and Qwen3.6 model families, built around solo and concurrent agentic-coding workloads. Everything here runs on Modal out of the box, but the serving layer is a thin wrapper over SGLang and vLLM, so you can run the same commands on any GPU host. See docs/deploy-byo-cloud.md.

Nothing here is hosted for you. You deploy it to your own account, and the endpoint URL is yours. There are no API keys, hostnames, or accounts baked into the code.

What's in here

The repo is three self-contained projects. Each has its own _common/ library, its own pyproject.toml, and its own deployments. Work from inside the one you need.

Project	What it does	Engine
`gemma4/`	Serve the Gemma 4 family (E2B, E4B, 12B, 26B-A4B, 31B), solo and concurrent, plus a Granite embedding sidecar	SGLang
`qwen/`	Serve Qwen3.6-27B and Qwen3.6-35B-A3B, solo and concurrent, plus a Granite embedding sidecar	SGLang
`pipeline/`	The research path: serve → score → generate a synthetic corpus → LoRA fine-tune, over the public Chinook SQL agent	vLLM

Each model ships two shapes:

solo — one user driving a coding harness. The whole GPU and KV cache go to one session with a large context window.
concurrent — several agents sharing one GPU with fair-share scheduling and a smaller per-session window.

Models

Model	Repo	Arch	Native ctx	MTP drafter
Gemma 4 E2B-it	`google/gemma-4-E2B-it`	Dense+PLE, 5.1B/~2B	128K	yes (78M)
Gemma 4 E4B-it	`google/gemma-4-E4B-it`	Dense+PLE, 8B/4.5B	128K	yes (79M)
Gemma 4 12B-it	`google/gemma-4-12B-it`	Dense, ~12B	256K	none published
Gemma 4 26B-A4B-it	`google/gemma-4-26B-A4B-it`	MoE, 25B/3.8B active	256K	yes (0.4B)
Gemma 4 31B-it	`google/gemma-4-31B-it`	Dense, 31B	256K	yes (0.5B)
Qwen3.6-27B	`Qwen/Qwen3.6-27B`	Dense hybrid	256K	architectural
Qwen3.6-35B-A3B	`Qwen/Qwen3.6-35B-A3B`	MoE hybrid, 35B/3B active	256K	architectural

All Gemma 4 and Qwen3.6 weights are Apache-2.0 and ungated, so no Hugging Face token is needed to pull them.

Quick start (Modal)

You need uv and a Modal account. Pick a project and work from inside it.

cd gemma4                      # or qwen, or pipeline
uv sync                        # installs the local control plane (modal, openai)
uv run modal token new         # one-time Modal auth

# Deploy a model/shape. The app name is printed, along with a public URL.
uv run modal deploy deployments/12b/solo/serve.py

# Hit the endpoint Modal printed (OpenAI-compatible).
curl $URL/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma-4-12b-it","messages":[{"role":"user","content":"hi"}]}'

# Stop it when you're done so it stops billing.
uv run modal app stop gemma4-12b-solo

The endpoint is public the moment you deploy it. That is convenient for testing and a liability if you leave it up. How to put auth in front of it is your call — see docs/securing-endpoints.md.

Chat templates

Each family ships the upstream template, a custom harness-friendly fork, a conformance suite, and a live probe you can point at your own endpoint.

Gemma 4: gemma4/chat_templates/ — the P1–P5 fork. See README and TESTING.
Qwen3.6: qwen/chat_templates/ — the Q1–Q8 fork. See README and TESTING.

The forks fix edge cases that bite agentic coding harnesses (opencode, pi, and similar): tool arguments arriving as JSON strings, reasoning getting dropped across multi-turn tool loops, thinking defaulting off, and a few template bugs. Each patch is gated, so passing the documented kwargs renders byte-for-byte identical to upstream. The conformance suites lock that down.

The research pipeline

pipeline/ is a worked example of taking a base model to a fine-tuned one, using a SQL agent over the public Chinook database as the task:

serve the model on vLLM,
eval it on a five-axis tool-use rubric,
corpus — generate a synthetic SFT corpus with a larger model as the teacher,
sft — LoRA fine-tune a small model and gate it against capability drift.

Running on your own cloud

The serve scripts are Modal wrappers around python -m sglang.launch_server and vllm serve. The same image and the same argv run anywhere you have a GPU. docs/deploy-byo-cloud.md shows how.

License

Apache 2.0. See LICENSE. The model weights carry their own licenses (Apache-2.0 for the Gemma 4 and Qwen3.6 checkpoints used here).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
gemma4		gemma4
pipeline		pipeline
qwen		qwen
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-tuning

What's in here

Models

Quick start (Modal)

Chat templates

The research pipeline

Running on your own cloud

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-tuning

What's in here

Models

Quick start (Modal)

Chat templates

The research pipeline

Running on your own cloud

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages