Add finetune memory instrumentation for issue #61 by taivu1998 · Pull Request #132 · ServiceNow/PipelineRL

taivu1998 · 2026-04-02T22:44:39Z

Summary

Related to #61.

This PR adds default-off memory instrumentation to the finetuning loop so we can localize the long-running trainer memory growth reported in issue #61.

This is intentionally not presented as a full fix for the leak by itself. The goal of this change is to make the trainer produce enough evidence to distinguish between:

process-private memory growth,
file/cache-heavy checkpoint behavior,
retained batch / RL-step state,
and weight-update or other native/shared-memory effects.

What Changed

Added pipelinerl.finetune.memory_debug.MemoryDebugger to capture per-rank JSONL snapshots.
Logged process memory stats including RSS/VMS/USS/PSS/shared, thread count, FD count, and optional /proc/self/smaps_rollup data.
Logged optional cgroup memory stats so runs can distinguish anon/file/shmem/slab growth when available.
Logged CUDA allocator counters alongside host-memory stats.
Added batch metadata to snapshots so memory changes can be tied back to workload shape.
Wired the debugger into run_data_loader() and rl_finetuning_worker().
Added probes around:
- startup,
- loader handoff,
- step boundaries,
- optional micro-batch phases,
- optimizer step,
- weight updates,
- checkpoint saves,
- final saves,
- shutdown.
Added a small finetune.memory_debug config block in conf/finetune/base.yaml.
Aligned force_restart cleanup with the actual finetune log directory (log/).

Config

The new config surface is intentionally small and default-off:

finetune:
  memory_debug:
    enabled: false
    log_every_n_steps: 100
    capture_smaps_rollup: false
    capture_cgroup_stats: true
    phase_granularity: step

When enabled, each rank writes:

finetune/log/memory_rank{rank}.jsonl

phase_granularity=micro_batch enables the finer loader / RL-step / backward probes.

Why This Shape

The issue screenshots suggest node-level host-memory growth, but they do not prove a single Python-RSS leak. This PR keeps the first instrumentation pass practical and low-risk:

default-off,
cheap step-level probes by default,
deeper probes only when explicitly requested,
no speculative GC or allocator tweaks in the normal training path.

Validation

I ran:

python3 -m compileall pipelinerl/finetune/memory_debug.py pipelinerl/finetune_loop.py
git diff --check
a focused smoke test covering:
- config validation,
- JSONL snapshot writing,
- batch metadata capture,
- safe no-op behavior after debugger shutdown.

Follow-Up

The next step for #61 is to run trainer-only replay with finetune.memory_debug.enabled=true on a captured training_data stream and use the emitted snapshots to decide whether the slope tracks:

checkpointing,
the core train loop,
or live weight updates / native memory.

taivu1998 · 2026-04-04T06:57:30Z

Hi @rizar, could you help review this PR? Thank you so much!

Add finetune memory debugging instrumentation

49f4c58

taivu1998 marked this pull request as ready for review April 2, 2026 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add finetune memory instrumentation for issue #61#132

Add finetune memory instrumentation for issue #61#132
taivu1998 wants to merge 1 commit intoServiceNow:mainfrom
taivu1998:tdv/issue-61-memory-instrumentation

taivu1998 commented Apr 2, 2026

Uh oh!

taivu1998 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 2, 2026

Summary

What Changed

Config

Why This Shape

Validation

Follow-Up

Uh oh!

taivu1998 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant