Fix LoRA model training#1385
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to unload stale LoRA adapter versions from the inference server, preventing VRAM accumulation and potential hangs during training. It adds a lora_keep_versions parameter to track the retention window, supports best-effort HTTP requests for cleanup, and propagates the LoRA adapter name through the OpenAI client and rollout configuration. Feedback suggests automatically syncing rollout.use_lora from actor.use_lora to avoid configuration mismatches, and catching Exception instead of BaseException when ignoring best-effort request failures to prevent catching system-exiting exceptions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if self.rollout.use_lora and not self.rollout.lora_name: | ||
| self.rollout.lora_name = self.gconfig.lora_name |
There was a problem hiding this comment.
To prevent configuration mismatch bugs where actor.use_lora is enabled but rollout.use_lora is left disabled (which would cause rollout to run without LoRA), we should automatically sync rollout.use_lora from actor.use_lora during initialization.
| if self.rollout.use_lora and not self.rollout.lora_name: | |
| self.rollout.lora_name = self.gconfig.lora_name | |
| if self.actor.use_lora: | |
| self.rollout.use_lora = True | |
| if self.rollout.use_lora and not self.rollout.lora_name: | |
| self.rollout.lora_name = self.gconfig.lora_name |
| default=False, | ||
| metadata={"help": "Whether to use LoRA. Should be same as actors LORA option."}, | ||
| ) | ||
| lora_name: str = field( |
There was a problem hiding this comment.
Please run docs/generate_cli_docs.py to update both en/zh versions of cli_reference.md at the same time and add them to this PR.。
ArealOpenAI rebuilt the generation GenerationHyperparameters without lora_name, so /generate requests always used the dataclass default "default_lora" while the trainer loaded the configured adapter (e.g. "lora-gsm8k-v0"). SGLang then rejected every rollout with "LoRA adapter that has never been loaded: default_lora-v0". Thread the configured adapter name through the request side: - Add InferenceEngineConfig.lora_name; PPOConfig.__post_init__ syncs it from gconfig.lora_name when rollout.use_lora (single source of truth). - ArealOpenAI and its AsyncCompletionsWithReward / AsyncResponsesWithReward resources accept lora_name and set it on the rebuilt gconfig. - proxy_rollout_server._setup_openai_client passes config.lora_name. Load and request sides now agree on '<lora_name>-vN'.
Each train step loaded a new versioned adapter (lora-<name>-v{N}) via
/load_lora_adapter but never unloaded older versions, so sglang accumulated
one adapter per step. VRAM crept up until sglang hung inside
update_weights_from_disk -- on the single 24GB GPU with enable_offload=false
(actor + sglang co-resident) it stalled at step 39 (step 38 update_weights was
10.3s, step 39 hit the 600s read timeout), well below the memory ceiling.
Unload the version that has fallen outside the retention window
(max_head_offpolicyness + 2, enough to cover off-policy rollouts) as a
best-effort request, logged-and-ignored on failure so it never breaks the
weight update.
- Add WeightUpdateMeta.lora_keep_versions and HttpRequest.best_effort
- SGLangBackend appends a best-effort /unload_lora_adapter for v{N-keep}
- disk-update executor ignores best-effort request failures
- rl_trainer derives lora_keep_versions from rollout.max_head_offpolicyness
Restored from standalone commit 386328b9 (lost in the controller v2 refactor).
Add the lora_name InferenceEngineConfig entry to the EN/ZH CLI reference tables, generated from the new config field. The adapter name lets generation requests select the served LoRA adapter and is auto-filled from gconfig.lora_name by PPOConfig.__post_init__ so the load and request sides stay in sync.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Description
Two independent but related bug fixes that unblock LoRA RL training on the
SGLang backend. Both surface on a single 24GB GPU with
enable_offload=false(actor + sglang co-resident), where the adapter load/request lifecycle is most
stressed.
1. Unload stale LoRA adapters on disk weight update (
61d1884a)Every train step loaded a new versioned adapter
lora-<name>-v{N}via/load_lora_adapterbut never unloaded older versions, so sglang accumulatedone adapter per step. VRAM crept up until sglang hung inside
update_weights_from_disk— observed stalling at step 39 (step 38's update was10.3s; step 39 hit the 600s read timeout), well below the memory ceiling.
Fix: emit a best-effort
/unload_lora_adapterfor the version that hasfallen outside the retention window (
max_head_offpolicyness + 2, enough tocover in-flight off-policy rollouts). The unload is logged-and-ignored on
failure so it can never break a weight update.
2. Thread
lora_namethrough to generation requests (67f7a4a0)ArealOpenAIrebuiltGenerationHyperparameterswithoutlora_name, so/generatealways used the dataclass defaultdefault_lorawhile the trainerloaded the configured adapter. SGLang then rejected every rollout with
LoRA adapter that has never been loaded: default_lora-v0.Fix: thread the configured adapter name through the request side, with
PPOConfig.__post_init__syncing it fromgconfig.lora_nameas the singlesource of truth.
Net effect: the load side and the request side now agree on
<lora_name>-vN,and stale adapters are reclaimed so long runs no longer hang.
Related Issue
N/A — discovered while running LoRA GRPO on a single-GPU colocated setup.
(Replace with
Fixes #<id>if a tracking issue exists.)Type of Change
Checklist
pre-commit run --all-files)main/review-prcommand/create-prBreaking Change Details (if applicable): N/A — all changes are additive,
backward-compatible.
Additional Context
Scope: 2 commits, 8 files, +133/−1, including a new 65-line unit test.
New tests —
tests/test_sglang_lora_unload.py, CPU-only (assert on therequest builder; no GPU / sglang server required):
$ pytest tests/test_sglang_lora_unload.py -q
5 passed in 4.45s
Covers: stale-version unload beyond the retention window; no unload within the
window;
lora_keep_versions=0preserving the old load-only behaviour; theunload being best-effort; and the non-LoRA disk path staying untouched.
Backward compatibility (non-breaking):
WeightUpdateMeta.lora_keep_versions,HttpRequest.best_effort, andInferenceEngineConfig.lora_nameare additive fields with defaults; existingconfigs and the full-model weight-update path are unchanged.
lora_keep_versions=0reproduces the exact prior behaviour (load only).Note: commit 1 restores logic from standalone commit
386328b9that waslost in the controller-v2 refactor.
Key files:
areal/api/io_struct.py—WeightUpdateMeta.lora_keep_versions,HttpRequest.best_effortareal/engine/sglang_remote.py— best-effort/unload_lora_adapterin the disk-update request builderareal/infra/remote_inf_engine.py— disk-update executor ignores best-effort failuresareal/trainer/rl_trainer.py— deriveslora_keep_versionsfromrollout.max_head_offpolicynessareal/api/cli_args.py—InferenceEngineConfig.lora_name+PPOConfig.__post_init__syncareal/experimental/openai/{client.py, proxy/proxy_rollout_server.py}— passlora_nameon the request side