Reduce peak memory during FLUX model load #7564

RyanJDick · 2025-01-16T22:14:36Z

Summary

Prior to this change, there were several cases where we initialized the weights of a FLUX model before loading its state dict (and, to make things worse, in some cases the weights were in float32). This PR fixes a handful of these cases. (I think I found all instances for the FLUX family of models.)

Related Issues / Discussions

Helps with High Windows Committed Memory (Virtual Memory) #7563

QA Instructions

I tested that that model loading still works and that there is no virtual memory reservation on model initialization for the following models:

FLUX VAE
Full T5 Encoder
Full FLUX checkpoint
GGUF FLUX checkpoint

Merge Plan

No special instructions.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

…models to ensure that the model is initialized on the meta device prior to loading the state dict into it. This helps to keep peak memory down.

## Summary This PR adds a `keep_ram_copy_of_weights` config option the default (and legacy) behavior is `true`. The tradeoffs for this setting are as follows: - `keep_ram_copy_of_weights: true`: Faster model switching and LoRA patching. - `keep_ram_copy_of_weights: false`: Lower average RAM load (may not help significantly with peak RAM). ## Related Issues / Discussions - Helps with #7563 - The Low-VRAM docs are updated to include this feature in #7566 ## QA Instructions - Test with `enable_partial_load: false` and `keep_ram_copy_of_weights: false`. - [x] RAM usage when model is loaded is reduced. - [x] Model loading / unloading works as expected. - [x] LoRA patching still works. - Test with `enable_partial_load: false` and `keep_ram_copy_of_weights: true`. - [x] Behavior should be unchanged. - Test with `enable_partial_load: true` and `keep_ram_copy_of_weights: false`. - [x] RAM usage when model is loaded is reduced. - [x] Model loading / unloading works as expected. - [x] LoRA patching still works. - Test with `enable_partial_load: true` and `keep_ram_copy_of_weights: true`. - [x] Behavior should be unchanged. - [x] Smoke test CPU-only and MPS with default configs. ## Merge Plan - [x] Merge #7564 first and change target branch. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [ ] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

Update the model loading logic for several of the large FLUX-related …

b2bb359

…models to ensure that the model is initialized on the meta device prior to loading the state dict into it. This helps to keep peak memory down.

RyanJDick requested review from lstein, blessedcoolant, brandonrising and hipsterusername as code owners January 16, 2025 22:14

github-actions bot added python PRs that change python files backend PRs that change backend files labels Jan 16, 2025

RyanJDick mentioned this pull request Jan 16, 2025

Add keep_ram_copy_of_weights config option #7565

Merged

14 tasks

hipsterusername approved these changes Jan 16, 2025

View reviewed changes

RyanJDick merged commit 0abb5ea into main Jan 16, 2025
15 checks passed

RyanJDick deleted the ryan/lower-virtual-memory branch January 16, 2025 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce peak memory during FLUX model load #7564

Reduce peak memory during FLUX model load #7564

RyanJDick commented Jan 16, 2025

Reduce peak memory during FLUX model load #7564

Reduce peak memory during FLUX model load #7564

Conversation

RyanJDick commented Jan 16, 2025

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist