Helper scripts and examples for running vLLM on Stanford clusters (Sherlock, Marlowe, Yens) and for evaluating fine-tuned LoRA adapters trained on Together.
This repo accompanies the blog post Fine-Tuning Open Source Models with Together + vLLM.
It provides the “try it yourself” walkthrough: from preparing JSONL datasets to running base and fine-tuned models on Sherlock.
In this example, we fine-tune Qwen3-8B-Base to classify Reddit posts into one of ten subreddits. With no fine-tuning, the base model reached an accuracy of 0.39 on our test set. After fine-tuning with LoRA adapters, accuracy nearly doubled to 0.74.
We’ll walk step by step through:
-
Preparing training data in JSONL format
-
Submitting a fine-tuning job to Together
-
Downloading the LoRA adapter
-
Running both the base and fine-tuned models locally with vLLM
Our task is: given the title and body of a Reddit post, predict which subreddit it belongs to.
-
Input: title + body
-
Output: subreddit name (one of ten choices)
We prepared a dataset with:
- Training set: 9,800 examples per class
- Validation set: 100 examples per class
- Test set: 500 examples per class
Each row is stored as JSONL (JSON Lines). Together expects the following format:
{"prompt": "Post title\n\nPost body", "completion": "subreddit_name"}
This structure is all you need — one input string, one output string per line.
On Sherlock we created a Python environment for data prep and training:
cd <project-space>/llm-ft
ml python/3.12.1
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
On the Yens, create a Python environment for data prep and training:
cd <project-space>/llm-ft
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
This ensures we can build JSONL files, interact with Together’s API, and later run inference locally.
With train.jsonl
, val.jsonl
and test.jsonl
ready, we upload the training and (optionally) validation sets:
together files upload train.jsonl
together files upload val.jsonl
You can launch the fine-tuning job via Together CLI or through the web interface.
The web interface makes it easy to adjust parameters, track progress, and view checkpoints.
Login to Together and go to Fine-tuning tab to start a new job.
First, select the base model.
We chose Qwen/Qwen3-8B-Base
model.
You’ll be prompted to select parameters. For our experiment, we chose the following:
First, select the base model.
We chose Qwen/Qwen3-8B-Base
model.
You’ll be prompted to select parameters. For our experiment, we chose the following:
-
Epochs: 1
-
Checkpoints: 1
-
Evaluations: 4
-
Batch size: 8
-
LoRA rank: 32
-
LoRA alpha: 64
-
LoRA dropout: 0.05
-
LoRA trainable modules:
q_proj
,k_proj
,v_proj
,o_proj
,gate_proj
,up_proj
,down_proj
-
Train on inputs: false
-
Learning rate: 0.0001
-
Learning rate scheduler: cosine
-
Warmup ratio: 0.06
-
Scheduler cycles: 3
-
Max gradient norm: 1
-
Weight decay: 0.01
You can optionally connect to your Weights & Biases project to track training and validation losses graphically.
You can experiment with different values depending on your dataset size and task.
When the job completes, you’ll be able to download the resulting LoRA adapter checkpoint (or the merged model). This adapter contains only the learned weights from fine-tuning — a lightweight file we’ll use in combination with the base model.
This training cost $5 and ran in 11 minutes.
After training is finished, download the LoRA adapter and copy to your project space on Sherlock.
In this case, we made a models directory in our project space and copy the adapter to <project-space>/llm-ft/models/qwen3-8b-1epoch-10k-data-32-lora
.
For inference, we will copy the adapter and unpack it on scratch.
On Sherlock:
export SCRATCH_BASE=$GROUP_SCRATCH/$USER
mkdir -p $SCRATCH_BASE/vllm/models
cp -r <project-space>/llm-ft/models/qwen3-8b-1epoch-10k-data-32-lora \
"$SCRATCH_BASE/vllm/models"
cd "$SCRATCH_BASE/vllm/models/qwen3-8b-1epoch-10k-data-32-lora"
tar --use-compress-program=unzstd -xvf ft-*.tar.zst -C .
On the Yens:
cp -r <project-space>/llm-ft/models/qwen3-8b-1epoch-10k-data-32-lora \
"$SCRATCH_BASE/vllm/models"
cd "$SCRATCH_BASE/vllm/models/qwen3-8b-1epoch-10k-data-32-lora"
tar --use-compress-program=unzstd -xvf ft-*.tar.zst -C .
Now the LoRA weights are unpacked and ready.
With our adapter ready, we now need to launch a vLLM server on a GPU node. This will host the base model (and later the fine-tuned adapter) so we can run inference from a login node.
First, request a GPU node with enough memory for Qwen3-8B.
On Sherlock:
srun -p gpu -G 1 -C "GPU_MEM:80GB" -n 1 -c 16 --mem=50G -t 2:00:00 --pty /bin/bash
On the Yens:
srun -p gpu -G 1 -C "GPU_MODEL:A40" -n 1 -c 16 --mem=50G -t 2:00:00 --pty /bin/bash
Load the vLLM module:
On Sherlock:
ml py-vllm/0.7.0_py312
Clone a repo with a helper wrapper to launch vLLM on available port:
git clone https://github.com/gsbdarc/vllm_helper.git
Make the environment:
cd vllm_helper
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Point scratch directories so large files don’t live on home:
export SCRATCH_BASE=$GROUP_SCRATCH/$USER
export APPTAINER_CACHEDIR=$SCRATCH_BASE/.apptainer
Pull the latest vLLM container and source the wrapper script:
apptainer pull vllm-openai.sif docker://vllm/vllm-openai:latest
source vllm.sh
Export the base model and start the server:
export VLLM_MODEL=Qwen/Qwen3-8B-Base
vllm serve &
You’ll see output with the GPU hostname and port — this confirms the server is running.
Once the server is up, we can run inference from a login node.
On the login node:
cd vllm_helper/example
ml python/3.12.1
source venv/bin/activate
pip install -r requirements.txt
export SCRATCH_BASE=$GROUP_SCRATCH/$USER
Run the baseline inference script on test set:
python infer_base_8b.py
This will query the running vLLM server and evaluate predictions from the base model.
- Final accuracy (base): 0.39 over 5,000 labeled examples
This is our baseline performance using the Qwen3-8B-Base model.
Now let’s load the LoRA adapter we downloaded from Together and repeat the experiment.
Stop and re-start vLLM server:
vllm stop
We will enable LoRA and point vLLM to the adapter path:
export VLLM_MODEL=Qwen/Qwen3-8B-Base
export VLLM_ENABLE_LORA=1
export VLLM_LORAS="reddit=/models/qwen3-8b-1epoch-10k-data-32-lora"
A few important notes here:
-
The string before the = (reddit) is the adapter name.
-
In our Python inference script, we refer to this adapter by name (
reddit
) when choosing which fine-tuned weights to apply. -
You can name it anything you like, but it must match between the environment variable and your code.
-
-
The path after the = points to the directory where the LoRA adapter files are unpacked.
Relaunch the vLLM server on your GPU node:
vllm serve --max-lora-rank 32 &
By default, vLLM only allows LoRA adapters up to rank 16. Because we trained with LoRA rank 32, we need to override this limit by specifying --max-lora-rank 32
. Without this flag, vLLM won’t load the adapter correctly.
Then, from the login node, run the fine-tuned inference script on the test set:
python infer_ft_8b.py
- Final accuracy (adapter): 0.74 over 5,000 labeled examples
This shows the impact of fine-tuning: accuracy nearly doubled compared to the base model.