Fix LoRA model training by lifeiteng · Pull Request #1385 · areal-project/AReaL

lifeiteng · 2026-06-03T06:18:14Z

Description

Two independent but related bug fixes that unblock LoRA RL training on the
SGLang backend. Both surface on a single 24GB GPU with enable_offload=false
(actor + sglang co-resident), where the adapter load/request lifecycle is most
stressed.

1. Unload stale LoRA adapters on disk weight update (61d1884a)
Every train step loaded a new versioned adapter lora-<name>-v{N} via
/load_lora_adapter but never unloaded older versions, so sglang accumulated
one adapter per step. VRAM crept up until sglang hung inside
update_weights_from_disk — observed stalling at step 39 (step 38's update was
10.3s; step 39 hit the 600s read timeout), well below the memory ceiling.
Fix: emit a best-effort /unload_lora_adapter for the version that has
fallen outside the retention window (max_head_offpolicyness + 2, enough to
cover in-flight off-policy rollouts). The unload is logged-and-ignored on
failure so it can never break a weight update.

2. Thread lora_name through to generation requests (67f7a4a0)
ArealOpenAI rebuilt GenerationHyperparameters without lora_name, so
/generate always used the dataclass default default_lora while the trainer
loaded the configured adapter. SGLang then rejected every rollout with
LoRA adapter that has never been loaded: default_lora-v0.
Fix: thread the configured adapter name through the request side, with
PPOConfig.__post_init__ syncing it from gconfig.lora_name as the single
source of truth.

Net effect: the load side and the request side now agree on <lora_name>-vN,
and stale adapters are reclaimed so long runs no longer hang.

Related Issue

N/A — discovered while running LoRA GRPO on a single-GPU colocated setup.
(Replace with Fixes #<id> if a tracking issue exists.)

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (N/A — internal bug fix, no user-facing docs)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable): N/A — all changes are additive,
backward-compatible.

Additional Context

Scope: 2 commits, 8 files, +133/−1, including a new 65-line unit test.

New tests — tests/test_sglang_lora_unload.py, CPU-only (assert on the
request builder; no GPU / sglang server required):

$ pytest tests/test_sglang_lora_unload.py -q
5 passed in 4.45s

Covers: stale-version unload beyond the retention window; no unload within the
window; lora_keep_versions=0 preserving the old load-only behaviour; the
unload being best-effort; and the non-LoRA disk path staying untouched.

Backward compatibility (non-breaking):

WeightUpdateMeta.lora_keep_versions, HttpRequest.best_effort, and
InferenceEngineConfig.lora_name are additive fields with defaults; existing
configs and the full-model weight-update path are unchanged.
lora_keep_versions=0 reproduces the exact prior behaviour (load only).

Note: commit 1 restores logic from standalone commit 386328b9 that was
lost in the controller-v2 refactor.

Key files:

areal/api/io_struct.py — WeightUpdateMeta.lora_keep_versions, HttpRequest.best_effort
areal/engine/sglang_remote.py — best-effort /unload_lora_adapter in the disk-update request builder
areal/infra/remote_inf_engine.py — disk-update executor ignores best-effort failures
areal/trainer/rl_trainer.py — derives lora_keep_versions from rollout.max_head_offpolicyness
areal/api/cli_args.py — InferenceEngineConfig.lora_name + PPOConfig.__post_init__ sync
areal/experimental/openai/{client.py, proxy/proxy_rollout_server.py} — pass lora_name on the request side

gemini-code-assist

Code Review

This pull request introduces a mechanism to unload stale LoRA adapter versions from the inference server, preventing VRAM accumulation and potential hangs during training. It adds a lora_keep_versions parameter to track the retention window, supports best-effort HTTP requests for cleanup, and propagates the LoRA adapter name through the OpenAI client and rollout configuration. Feedback suggests automatically syncing rollout.use_lora from actor.use_lora to avoid configuration mismatches, and catching Exception instead of BaseException when ignoring best-effort request failures to prevent catching system-exiting exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-03T06:20:37Z

+        if self.rollout.use_lora and not self.rollout.lora_name:
+            self.rollout.lora_name = self.gconfig.lora_name


To prevent configuration mismatch bugs where actor.use_lora is enabled but rollout.use_lora is left disabled (which would cause rollout to run without LoRA), we should automatically sync rollout.use_lora from actor.use_lora during initialization.

Suggested change

if self.rollout.use_lora and not self.rollout.lora_name:

self.rollout.lora_name = self.gconfig.lora_name

if self.actor.use_lora:

self.rollout.use_lora = True

if self.rollout.use_lora and not self.rollout.lora_name:

self.rollout.lora_name = self.gconfig.lora_name

sitabulaixizawaluduo · 2026-06-04T03:11:55Z

        default=False,
        metadata={"help": "Whether to use LoRA. Should be same as actors LORA option."},
    )
+    lora_name: str = field(


Please run docs/generate_cli_docs.py to update both en/zh versions of cli_reference.md at the same time and add them to this PR.。

ArealOpenAI rebuilt the generation GenerationHyperparameters without lora_name, so /generate requests always used the dataclass default "default_lora" while the trainer loaded the configured adapter (e.g. "lora-gsm8k-v0"). SGLang then rejected every rollout with "LoRA adapter that has never been loaded: default_lora-v0". Thread the configured adapter name through the request side: - Add InferenceEngineConfig.lora_name; PPOConfig.__post_init__ syncs it from gconfig.lora_name when rollout.use_lora (single source of truth). - ArealOpenAI and its AsyncCompletionsWithReward / AsyncResponsesWithReward resources accept lora_name and set it on the rebuilt gconfig. - proxy_rollout_server._setup_openai_client passes config.lora_name. Load and request sides now agree on '<lora_name>-vN'.

Each train step loaded a new versioned adapter (lora-<name>-v{N}) via /load_lora_adapter but never unloaded older versions, so sglang accumulated one adapter per step. VRAM crept up until sglang hung inside update_weights_from_disk -- on the single 24GB GPU with enable_offload=false (actor + sglang co-resident) it stalled at step 39 (step 38 update_weights was 10.3s, step 39 hit the 600s read timeout), well below the memory ceiling. Unload the version that has fallen outside the retention window (max_head_offpolicyness + 2, enough to cover off-policy rollouts) as a best-effort request, logged-and-ignored on failure so it never breaks the weight update. - Add WeightUpdateMeta.lora_keep_versions and HttpRequest.best_effort - SGLangBackend appends a best-effort /unload_lora_adapter for v{N-keep} - disk-update executor ignores best-effort request failures - rl_trainer derives lora_keep_versions from rollout.max_head_offpolicyness Restored from standalone commit 386328b9 (lost in the controller v2 refactor).

Add the lora_name InferenceEngineConfig entry to the EN/ZH CLI reference tables, generated from the new config field. The adapter name lets generation requests select the served LoRA adapter and is auto-filled from gconfig.lora_name by PPOConfig.__post_init__ so the load and request sides stay in sync.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

lifeiteng requested review from HwVanICI, fishcrap, garrett4wade, geshi001, guozhihao-224, nuzant, rchardx and sitabulaixizawaluduo as code owners June 3, 2026 06:18

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

sitabulaixizawaluduo reviewed Jun 4, 2026

View reviewed changes

lifeiteng and others added 4 commits June 5, 2026 10:27

Update areal/infra/remote_inf_engine.py

a9678e3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

lifeiteng force-pushed the fixlora branch from 32725b1 to a9678e3 Compare June 5, 2026 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LoRA model training#1385

Fix LoRA model training#1385
lifeiteng wants to merge 4 commits into
areal-project:mainfrom
lifeiteng:fixlora

lifeiteng commented Jun 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

Uh oh!

sitabulaixizawaluduo Jun 4, 2026

Uh oh!

lifeiteng Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if self.rollout.use_lora and not self.rollout.lora_name:
		self.rollout.lora_name = self.gconfig.lora_name

Conversation

lifeiteng commented Jun 3, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sitabulaixizawaluduo Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

lifeiteng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants