Current best practices to initialize massive (50B parameter+) models #16944

azton · 2023-03-03T15:44:50Z

azton
Mar 3, 2023

Hi, I am working with GPT-style models and need to intitialize a model at the GPT-3 scale. Unfortunately, this means the model will run out of memory during initialization on CPU (or take an eternity to initialize layer-by-layer on cpu before shipping to GPU). In vanilla Pytorch I solved this using FSDP by initializing my models on the "meta" device, with full initialization on GPU afterward. What is the current best, most performant method to accomplish this with lightning?

Note: I found this, which references an init_meta_context(), but my pytorch-lightning (v1.9.0) has no such functionality:
https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71

Answered by carmocca

Mar 9, 2023

Hi! The init_meta_context functionality was replaced with a torchdistx integration in #13868. You can do the following:

from torchdistx.deferred_init import deferred_init

model = deferred_init(YourLightningModule)

And we'll materialize it for you in the Trainer. This is very experimental, and you might encounter installation issues.

In the long term, we'll adopt the fake tensor mode from PyTorch: #16448.

Otherwise, for a stable(r) solution, you can use the DeepSpeed integration: https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#deepspeed-zero-stage-3

View full answer

carmocca · 2023-03-09T13:53:53Z

carmocca
Mar 9, 2023

Hi! The init_meta_context functionality was replaced with a torchdistx integration in #13868. You can do the following:

from torchdistx.deferred_init import deferred_init

model = deferred_init(YourLightningModule)

And we'll materialize it for you in the Trainer. This is very experimental, and you might encounter installation issues.

In the long term, we'll adopt the fake tensor mode from PyTorch: #16448.

Otherwise, for a stable(r) solution, you can use the DeepSpeed integration: https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#deepspeed-zero-stage-3

0 replies

justusschock · 2023-03-09T13:55:37Z

justusschock
Mar 9, 2023
Maintainer

Hey, in addition to @carmocca response, for deepspeed you need to initialize the model within configure_sharded_model or you'll run in the same issues.

1 reply

carmocca Mar 9, 2023

Yes. This is described in https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

azton · 2023-03-09T15:15:51Z

azton
Mar 9, 2023
Author

Thank you, I got this working with the Deepspeed integration--which was failing before because of my own errors. Just a small follow-up: does reloading the model from a checkpoint still function the same and utilize the configure_sharded_model logic to avoid memory overflow, or is there anything special that needs to happen on my end?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Current best practices to initialize massive (50B parameter+) models #16944

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Current best practices to initialize massive (50B parameter+) models #16944

Uh oh!

Uh oh!

azton Mar 3, 2023

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

carmocca Mar 9, 2023

Uh oh!

justusschock Mar 9, 2023 Maintainer

Uh oh!

carmocca Mar 9, 2023

Uh oh!

azton Mar 9, 2023 Author

azton
Mar 3, 2023

Replies: 3 comments 1 reply

carmocca
Mar 9, 2023

justusschock
Mar 9, 2023
Maintainer

azton
Mar 9, 2023
Author