Fix Race condition in --async-offload that can cause corruption #10501
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Core change notes (2nd commit):
This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.
Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.
So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.
The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).
The pattern is now:
cast_bias_weight(... , offloadable=True)
thing(weight, bias, ...)
uncast_bias_weight(...)
Example test case:
Linux, RTX3060, 96GB RAM
Wan 2.2 I2V FP8, 192x192x9f, 5+5 steps
python main.py --novram --async-offload
This repeatedly blacks screens and corrupts my outputs:
For control sake, here is the same without --async-offload:
And with this fix:
The function signature change to cast_bias_weight may cause performance regression for users of --async-offload with custom node packs that call this function as a helper. I dont see a way to both keep performance and solve the race without updating custom nodes packs one by one
Amongst my ever growing collection of custom-nodes there are some in @kijai publications (FYI):