offload_to_disk=True is very slow for second initial forward and uses more VRAM #2174
avtc
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@Qubitium
Hi, I noticed that when
offload_to_disk=Falsethe second initial forward is very fast (around 6 seconds for 1534 samples on GLM-4.5-Air)Also in this mode less VRAM is used for same dataset. I was able to pass layers 0 and 1, and proceeding. With vram_strategy="balanced" on 8 x 3090.
While when
offload_to_disk=Trueit takes 6+ minutes and it uses more VRAM, so was not able to pass layer 0, even with "balanced":Another observation is Minimax-M2 on my setup pass first layer with 16 samples, but with 64+ samples there is CUDA OOM on line 347 in modeling_minimax_m2.py:
attn_weights = torch.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)I will think about moving inputs and outputs back and forth between VRAM and RAM (or to another GPU) during forward pass, to be able to use more samples. And/or optimizing modeling_minimax_m2.py to release unneeded tensors.
Beta Was this translation helpful? Give feedback.
All reactions