Skip to content

jscott3201/llm-tuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-tuning

Serving and fine-tuning code for the Google Gemma 4 and Qwen3.6 model families, built around solo and concurrent agentic-coding workloads. Everything here runs on Modal out of the box, but the serving layer is a thin wrapper over SGLang and vLLM, so you can run the same commands on any GPU host. See docs/deploy-byo-cloud.md.

Nothing here is hosted for you. You deploy it to your own account, and the endpoint URL is yours. There are no API keys, hostnames, or accounts baked into the code.

What's in here

The repo is three self-contained projects. Each has its own _common/ library, its own pyproject.toml, and its own deployments. Work from inside the one you need.

Project What it does Engine
gemma4/ Serve the Gemma 4 family (E2B, E4B, 12B, 26B-A4B, 31B), solo and concurrent, plus a Granite embedding sidecar SGLang
qwen/ Serve Qwen3.6-27B and Qwen3.6-35B-A3B, solo and concurrent, plus a Granite embedding sidecar SGLang
pipeline/ The research path: serve → score → generate a synthetic corpus → LoRA fine-tune, over the public Chinook SQL agent vLLM

Each model ships two shapes:

  • solo — one user driving a coding harness. The whole GPU and KV cache go to one session with a large context window.
  • concurrent — several agents sharing one GPU with fair-share scheduling and a smaller per-session window.

Models

Model Repo Arch Native ctx MTP drafter
Gemma 4 E2B-it google/gemma-4-E2B-it Dense+PLE, 5.1B/~2B 128K yes (78M)
Gemma 4 E4B-it google/gemma-4-E4B-it Dense+PLE, 8B/4.5B 128K yes (79M)
Gemma 4 12B-it google/gemma-4-12B-it Dense, ~12B 256K none published
Gemma 4 26B-A4B-it google/gemma-4-26B-A4B-it MoE, 25B/3.8B active 256K yes (0.4B)
Gemma 4 31B-it google/gemma-4-31B-it Dense, 31B 256K yes (0.5B)
Qwen3.6-27B Qwen/Qwen3.6-27B Dense hybrid 256K architectural
Qwen3.6-35B-A3B Qwen/Qwen3.6-35B-A3B MoE hybrid, 35B/3B active 256K architectural

All Gemma 4 and Qwen3.6 weights are Apache-2.0 and ungated, so no Hugging Face token is needed to pull them.

Quick start (Modal)

You need uv and a Modal account. Pick a project and work from inside it.

cd gemma4                      # or qwen, or pipeline
uv sync                        # installs the local control plane (modal, openai)
uv run modal token new         # one-time Modal auth

# Deploy a model/shape. The app name is printed, along with a public URL.
uv run modal deploy deployments/12b/solo/serve.py

# Hit the endpoint Modal printed (OpenAI-compatible).
curl $URL/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma-4-12b-it","messages":[{"role":"user","content":"hi"}]}'

# Stop it when you're done so it stops billing.
uv run modal app stop gemma4-12b-solo

The endpoint is public the moment you deploy it. That is convenient for testing and a liability if you leave it up. How to put auth in front of it is your call — see docs/securing-endpoints.md.

Chat templates

Each family ships the upstream template, a custom harness-friendly fork, a conformance suite, and a live probe you can point at your own endpoint.

The forks fix edge cases that bite agentic coding harnesses (opencode, pi, and similar): tool arguments arriving as JSON strings, reasoning getting dropped across multi-turn tool loops, thinking defaulting off, and a few template bugs. Each patch is gated, so passing the documented kwargs renders byte-for-byte identical to upstream. The conformance suites lock that down.

The research pipeline

pipeline/ is a worked example of taking a base model to a fine-tuned one, using a SQL agent over the public Chinook database as the task:

  1. serve the model on vLLM,
  2. eval it on a five-axis tool-use rubric,
  3. corpus — generate a synthetic SFT corpus with a larger model as the teacher,
  4. sft — LoRA fine-tune a small model and gate it against capability drift.

Running on your own cloud

The serve scripts are Modal wrappers around python -m sglang.launch_server and vllm serve. The same image and the same argv run anywhere you have a GPU. docs/deploy-byo-cloud.md shows how.

License

Apache 2.0. See LICENSE. The model weights carry their own licenses (Apache-2.0 for the Gemma 4 and Qwen3.6 checkpoints used here).

About

Serving and fine-tuning for the Gemma 4 and Qwen3.6 model families on Modal (SGLang/vLLM) — solo and concurrent shapes, custom chat-template forks, and a research pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors