Course: COMS E6998-012 High-Performance Machine Learning (Fall 2025)
This repository contains code and setup for benchmarking parameter-efficient fine-tuning (PEFT) methods such as LoRA and QLoRA against full fine-tuning baselines.
We evaluate trade-offs across GPU memory, latency, and accuracy using the Columbia Insomnia GPU Cluster.
- Team Name: HPML-PEFT
- Members:
- Keemin Lee (kjl2175)
- Sreeram Raghammudi (sr4314)
- Aaryaman Bajaj (ab6105)
- Aryaman Velampalli (akv2129)
- Aravindan Jambunathan (aj3394)
Fine-tuning large transformer-based language models is computationally expensive in terms of GPU memory, training time, and energy consumption. This project investigates whether parameter-efficient fine-tuning (PEFT) techniquesโspecifically LoRA and QLoRAโcan substantially reduce resource usage while maintaining competitive task accuracy. The objective is to characterize accuracyโefficiency trade-offs under constrained hardware settings using controlled, reproducible benchmarks.
- Base Model: DistilBERT (66.9M parameters)
- Framework: PyTorch + Hugging Face Transformers
- Fine-Tuning Variants:
- Full fine-tuning (all parameters trainable)
- LoRA: Low-rank adapters applied to query and value projection layers
- QLoRA: LoRA combined with 4-bit NF4 quantization via bitsandbytes
- Custom Modifications:
- Parameter freezing for backbone weights
- Rank sweep support (r โ {4, 8, 16, 32})
- Quantized base weights with BF16 compute for QLoRA
| Metric | Value |
|---|---|
| Dataset | SST-2 (GLUE) |
| Best Baseline Accuracy | 91.86% |
| Best LoRA Accuracy | 87.96% (r=16/32) |
| Best QLoRA Accuracy | 88.19% (r=32) |
| Peak GPU Memory (Baseline) | ~1306 MB |
| Peak GPU Memory (QLoRA) | ~696 MB |
| Training Time/Epoch (Baseline) | ~171 s |
| Training Time/Epoch (LoRA) | ~48 s |
| Device | NVIDIA A100 |
Training and evaluation metrics are logged to Weights & Biases:
๐ https://wandb.ai/keemin/huggingface?nw=nwuserkeeminlee ๐ https://wandb.ai/akv2129-columbia-university/peft_benchmark_final?nw=nwuserakv2129
This repository supports training-only benchmarking of fine-tuning strategies. Inference is not the primary focus and is limited to validation-time evaluation during training.
hpml-peft-benchmark/
โโ scripts/ # Training, profiling, and evaluation scripts
โโ slurm/ # SLURM job submission scripts
โโ docs/ # Environment and setup documentation
โโ env/ # Environment specs (requirements.txt, YAML)
โโ reports/ # Experimental results and analysis
โโ notebooks/ # Optional interactive analysis notebooks
Note:
data/,outputs/, andlogs/are stored externally on cluster scratch space.
module load anaconda/2023.09
eval "$(/insomnia001/shared/apps/anaconda/2023.09/bin/conda shell.bash hook)"conda create -p $HOME/.conda/envs/peft_benchmark python=3.10 -y
conda activate $HOME/.conda/envs/peft_benchmarkpip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
pip install transformers datasets peft bitsandbytes accelerate deepspeed wandb pynvmlAdd this to your ~/.bashrc:
eval "$(/insomnia001/shared/apps/anaconda/2023.09/bin/conda shell.bash hook)"
export CONDA_ENVS_DIRS="$HOME/.conda/envs"
export CONDA_PKGS_DIRS="$HOME/.conda/pkgs"Keemin's scratch space (serving as shared data/cache source):
/insomnia001/depts/edu/COMS-E6998-012/kjl2175/
โโ code/ โ cloned GitHub repo
โโ data/ โ read-only shared datasets
โโ outputs/ โ personal model checkpoints (per-user)
โโ logs/ โ SLURM + training logs (per-user)
โโ cache/ โ shared Hugging Face cache
Each teammate keeps their own outputs/logs, while reading shared data and cache from Keeminโs scratch.
MY_SCR=/insomnia001/depts/edu/COMS-E6998-012/<your-UNI>
mkdir -p $MY_SCR/{outputs,logs}All scripts share the same CLI. Replace OUT/LOG roots with your scratch paths.
# Baseline fine-tuning
python scripts/train_baseline.py --outdir OUT --logdir LOG --report_to none
# LoRA (rank sweep supported via --rank)
python scripts/train_lora.py --outdir OUT --logdir LOG --report_to none --rank 8
# QLoRA (NF4 4-bit quantization)
python scripts/train_qlora.py --outdir OUT --logdir LOG --report_to none --rank 8Each run creates <outdir>/<run_id>/ containing env.json, metrics.csv, summary.json, summary.csv, and a checkpoint/ (baseline) or adapter/ (LoRA/QLoRA) directory.
srun --pty -t 0-01:00 --gres=gpu:1 -A edu /bin/bash
module load anaconda/2023.09
eval "$(/insomnia001/shared/apps/anaconda/2023.09/bin/conda shell.bash hook)"
conda activate $HOME/.conda/envs/peft_benchmark
python scripts/test_gpu.pyExample SLURM file slurm/train_baseline.slurm:
#!/bin/bash
#SBATCH --job-name=baseline_sst2
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00
#SBATCH --account=edu
#SBATCH --output=/insomnia001/depts/edu/COMS-E6998-012/<your-UNI>/logs/%x-%j.out
module load anaconda/2023.09
eval "$(/insomnia001/shared/apps/anaconda/2023.09/bin/conda shell.bash hook)"
conda activate $HOME/.conda/envs/peft_benchmark
export HF_HOME=/insomnia001/depts/edu/COMS-E6998-012/kjl2175/cache/hf
export TRANSFORMERS_CACHE=$HF_HOME
python scripts/train_baseline.py \
--outdir /insomnia001/depts/edu/COMS-E6998-012/<your-UNI>/outputs/${USER} \
--logdir /insomnia001/depts/edu/COMS-E6998-012/<your-UNI>/logs/${USER} \
--report_to noneSubmit and monitor:
cd slurm
sbatch train_baseline.slurm
squeue -u $USER
tail -f /insomnia001/depts/edu/COMS-E6998-012/<your-UNI>/logs/baseline_sst2-<JOBID>.outACLs are not supported on this filesystem, so direct shared writing is unavailable.
Instead, follow this model:
| Path | Access | Description |
|---|---|---|
/insomnia.../kjl2175/data/ |
Read-only | Shared datasets for all |
/insomnia.../kjl2175/cache/hf/ |
Read-only | Shared Hugging Face model cache |
/insomnia.../<UNI>/outputs/ |
Read/Write (owner only) | Each teammateโs training outputs |
/insomnia.../<UNI>/logs/ |
Read/Write (owner only) | Job logs per teammate |
chmod -R a+rX /insomnia001/depts/edu/COMS-E6998-012/kjl2175/data
chmod -R a+rX /insomnia001/depts/edu/COMS-E6998-012/kjl2175/cache-
Clone the repo
git clone [email protected]:keeminlee/hpml-peft-benchmark.git
-
Create environment
conda create -p $HOME/.conda/envs/peft_benchmark python=3.10 -y conda activate $HOME/.conda/envs/peft_benchmark pip install -r env/requirements.txt
-
Set shared paths
export HF_HOME=/insomnia001/depts/edu/COMS-E6998-012/kjl2175/cache/hf export TRANSFORMERS_CACHE=$HF_HOME DATA_DIR=/insomnia001/depts/edu/COMS-E6998-012/kjl2175/data
-
Run jobs using your own scratch outputs/logs
MY_SCR=/insomnia001/depts/edu/COMS-E6998-012/<your-UNI> python scripts/train_baseline.py --outdir $MY_SCR/outputs/${USER} --logdir $MY_SCR/logs/${USER} --report_to none
After running experiments, aggregate every summary.json under outputs/ into a single CSV:
python scripts/collect_results.py --root outputs --out reports/results.csvGenerate quick plots (saved to reports/figures/):
python reports/summary.pyThe benchmark compares accuracy vs. memory vs. throughput across full fine-tuning, LoRA, and QLoRA (r โ {4, 8, 16}). Higher throughput and lower VRAM typically come with some accuracy trade-off; the unified summaries make these trade-offs easy to inspect.
- SSH access to cluster confirmed
- Repo cloned from GitHub
- Conda environment created
- Shared data/cache accessible (read-only)
- Jobs writing to per-user scratch directories
- Baseline SLURM job runs successfully