Skip to content

Add finetune memory instrumentation for issue #61#132

Open
taivu1998 wants to merge 1 commit intoServiceNow:mainfrom
taivu1998:tdv/issue-61-memory-instrumentation
Open

Add finetune memory instrumentation for issue #61#132
taivu1998 wants to merge 1 commit intoServiceNow:mainfrom
taivu1998:tdv/issue-61-memory-instrumentation

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Related to #61.

This PR adds default-off memory instrumentation to the finetuning loop so we can localize the long-running trainer memory growth reported in issue #61.

This is intentionally not presented as a full fix for the leak by itself. The goal of this change is to make the trainer produce enough evidence to distinguish between:

  • process-private memory growth,
  • file/cache-heavy checkpoint behavior,
  • retained batch / RL-step state,
  • and weight-update or other native/shared-memory effects.

What Changed

  • Added pipelinerl.finetune.memory_debug.MemoryDebugger to capture per-rank JSONL snapshots.
  • Logged process memory stats including RSS/VMS/USS/PSS/shared, thread count, FD count, and optional /proc/self/smaps_rollup data.
  • Logged optional cgroup memory stats so runs can distinguish anon/file/shmem/slab growth when available.
  • Logged CUDA allocator counters alongside host-memory stats.
  • Added batch metadata to snapshots so memory changes can be tied back to workload shape.
  • Wired the debugger into run_data_loader() and rl_finetuning_worker().
  • Added probes around:
    • startup,
    • loader handoff,
    • step boundaries,
    • optional micro-batch phases,
    • optimizer step,
    • weight updates,
    • checkpoint saves,
    • final saves,
    • shutdown.
  • Added a small finetune.memory_debug config block in conf/finetune/base.yaml.
  • Aligned force_restart cleanup with the actual finetune log directory (log/).

Config

The new config surface is intentionally small and default-off:

finetune:
  memory_debug:
    enabled: false
    log_every_n_steps: 100
    capture_smaps_rollup: false
    capture_cgroup_stats: true
    phase_granularity: step

When enabled, each rank writes:

  • finetune/log/memory_rank{rank}.jsonl

phase_granularity=micro_batch enables the finer loader / RL-step / backward probes.

Why This Shape

The issue screenshots suggest node-level host-memory growth, but they do not prove a single Python-RSS leak. This PR keeps the first instrumentation pass practical and low-risk:

  • default-off,
  • cheap step-level probes by default,
  • deeper probes only when explicitly requested,
  • no speculative GC or allocator tweaks in the normal training path.

Validation

I ran:

  • python3 -m compileall pipelinerl/finetune/memory_debug.py pipelinerl/finetune_loop.py
  • git diff --check
  • a focused smoke test covering:
    • config validation,
    • JSONL snapshot writing,
    • batch metadata capture,
    • safe no-op behavior after debugger shutdown.

Follow-Up

The next step for #61 is to run trainer-only replay with finetune.memory_debug.enabled=true on a captured training_data stream and use the emitted snapshots to decide whether the slope tracks:

  • checkpointing,
  • the core train loop,
  • or live weight updates / native memory.

@taivu1998 taivu1998 marked this pull request as ready for review April 2, 2026 22:46
@taivu1998
Copy link
Copy Markdown
Author

Hi @rizar, could you help review this PR? Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant