Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Oct 27, 2025

Core change notes (2nd commit):

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True)
thing(weight, bias, ...)
uncast_bias_weight(...)


Example test case:

Linux, RTX3060, 96GB RAM
Wan 2.2 I2V FP8, 192x192x9f, 5+5 steps
python main.py --novram --async-offload

This repeatedly blacks screens and corrupts my outputs:

fp8-async-nvr-fail3

For control sake, here is the same without --async-offload:

fp8-async-nvr-control-no-async

And with this fix:

fp8-async-nvr-fixed

The function signature change to cast_bias_weight may cause performance regression for users of --async-offload with custom node packs that call this function as a helper. I dont see a way to both keep performance and solve the race without updating custom nodes packs one by one

Amongst my ever growing collection of custom-nodes there are some in @kijai publications (FYI):

./ComfyUI-KJNodes/nodes/model_optimization_nodes.py:203:                        weight, bias = cast_bias_weight(self, input)
./ComfyUI-WanVideoWrapper/custom_linear.py:88:        weight, bias = cast_bias_weight(self, input)

@rattus128 rattus128 force-pushed the async-offload-streams branch from e61b839 to a50c8c2 Compare October 27, 2025 09:58
@rattus128
Copy link
Contributor Author

This one would do well with an integration test, is there a way to integ test with --async-offload set?

@Kosinkadink
Copy link
Collaborator

Nice! Talked with comfy and this will have to go in after #10498 is merged, so this PR will need to be rebased on that one.

@comfyanonymous
Copy link
Owner

Can you rebase this on the latest?

Make this a reusable function.
This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)
This is nessacary for safe async weight offloading.
Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
@rattus128 rattus128 force-pushed the async-offload-streams branch from a50c8c2 to 6ceba8c Compare October 29, 2025 11:39
@rattus128
Copy link
Contributor Author

Can you rebase this on the latest?

Done. Thanks.

@Kosinkadink
Copy link
Collaborator

I think it's good to go - I'll double check with comfy before merging

@comfyanonymous comfyanonymous merged commit ab7ab5b into comfyanonymous:master Oct 29, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants