This environment’s training signal is the same composite reward as evaluation: DuckDB execution (speedup + correctness), issue keywords, and light structure checks. There is no separate “training reward” that could diverge from deployment.
| Entry | Purpose |
|---|---|
train.py |
Custom GRPO-style loop: sample task → generate a group of completions → score each with env.step / grade → advantage normalize → policy update |
train.py --use-trl |
Optional path using Hugging Face TRL GRPOTrainer (requires trl, proper KL handling) |
| Kaggle notebook | Full 100-episode run with plots (linked from README) |
| Field | Default | Notes |
|---|---|---|
model_name |
Qwen/Qwen2.5-0.5B-Instruct |
Small model for free-tier GPUs |
num_episodes |
200 | Full runs; reduce for smoke tests |
group_size |
4 | GRPO group size (G) |
max_new_tokens |
1024 | JSON action payload |
temperature |
0.8 | Sampling during rollout |
learning_rate |
1e-5 | AdamW |
output_dir |
./checkpoints |
Model + training_history.json + optional training_curves.png |
Override by editing TrainConfig in train.py or extending the script (no CLI flags on the simple trainer today).
- CUDA: Recommended;
device_map="auto"when available. - CPU: Supported but slow; DuckDB warm-up + many forward passes dominate.
- Tasks: Fixed set in
tasks.py; each episode samples uniformly unless you changetrain.py. - Randomness:
random.choicefor task id;model.generateuses sampling — set seeds in PyTorch / CUDA / NumPy at the top oftrain.pyif you need bitwise reproducibility for a paper run.
Fine-tuned weights referenced in the README: laterabhi/grpo-sql-optimizer.
python training/eval_before_after.py --save-dir resultsShows how much reward comes from actually running optimized SQL vs analysis-only (see results.md).