Batch size increase when using FSDP, rather than reducing the memory usage #16855
Unanswered
Luofan-KK
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments
-
Also noticing this. I would be curious why the batch size increases when I'd rather it stay at the prescribed value in the data loader |
Beta Was this translation helpful? Give feedback.
0 replies
-
Digging into the docs - I think this can be resolved by setting (in Trainer)
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I was told that Fully Sharded Training shards the entire model across all available GPUs. So I guess given a fixed batch size, when available GPUs increase, the memory usage at each GPU will decrease.
But in my practice, when I using more GPU, the actual batch size increase:
actual_batch_size= num_GPUs * given_batch_size
, rather than sharding model into fine grained. I want to tune a huge model, and this must cause OOM.I've searched similar questions. And there is an example. But how can I determine the batch size?
Following is my code:
Thanks
Beta Was this translation helpful? Give feedback.
All reactions