Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add move_to_device kwarg to the optimizer's load_state_dict #1344

Merged
merged 1 commit into from
Sep 19, 2024

Conversation

koute
Copy link
Contributor

@koute koute commented Aug 31, 2024

This PR makes it possible to load an optimizer checkpoint without automatically moving the optimizer's state to the GPU.

Some background as to why: I'm keeping the optimizer's state on the CPU to save on VRAM and I manually move it to the GPU as needed. Unfortunately the load_state_dict will move all of the optimizer's tensors to whatever device the model's parameters are currently on, which results in an OOM crash. So currently before loading an optimizer checkpoint I have to unnecessarily move my model to the CPU, call the optimizer's load_state_dict, and then move the model back to the GPU. With this PR I can skip this silly dance.

Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas matthewdouglas self-assigned this Sep 10, 2024
Copy link

@hansonw hansonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered this issue as well but when using the paged variants of the optimizers (load_state_dict should be re-creating paged tensors instead of just using .to(param.device)).

My solution (see suggestion below) was to alter the initialization to use self.get_state_buffer instead. It's still kind of orthogonal to this PR (but the intent is similar) - I can submit a separate PR, but curious what the maintainers think.

Comment on lines +200 to +201
if move_to_device:
value[k] = v.to(param.device)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if move_to_device:
value[k] = v.to(param.device)
buffer = self.get_state_buffer(v, v.dtype)
buffer.copy_(v)
value[k] = buffer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hansonw! This seems reasonable as a separate PR!

This makes it possible to load an optimizer checkpoint without
automatically moving the optimizer's state to the GPU.
@matthewdouglas matthewdouglas merged commit 8fc7892 into bitsandbytes-foundation:main Sep 19, 2024
28 checks passed
matthewdouglas pushed a commit to matthewdouglas/bitsandbytes that referenced this pull request Oct 28, 2024
…andbytes-foundation#1344)

This makes it possible to load an optimizer checkpoint without
automatically moving the optimizer's state to the GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants