Whole model gets offloaded to the CPU #1122

SzymonOzog · 2025-02-04T10:23:13Z

I'm running the following code to calculate an offset map:

     MODEL_ID = "deepseek-ai/DeepSeek-R1"
     device_map = calculate_offload_device_map(MODEL_ID, num_gpus=8, reserve_for_hessians=True, trust_remote_code=True)
    
     print("calculated device map", device_map)

After it finishes it decides to offload the whole model to my CPU which results in a very slow compression.

calculated device map OrderedDict({'model.embed_tokens': 'cpu', 'model.layers.0': 'cpu', 'model.layers.1': 'cpu', 'model.layers.2': 'cpu', 'model.layers.3': 'cpu', 'model.layers.4': 'cpu', 'model.layers.5': 'cpu', 'model.layers.6': 'cpu', 'model.layers.7': 'cpu', 'model.layers.8': 'cpu', 'model.layers.9': 'cpu', 'model.layers.10': 'cpu', 'model.layers.11': 'cpu', 'model.layers.12': 'cpu', 'model.layers.13': 'cpu', 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'disk', 'model.layers.29': 'disk', 'model.layers.30': 'disk', 'model.layers.31': 'disk', 'model.layers.32': 'disk', 'model.layers.33': 'disk', 'model.layers.34': 'disk', 'model.layers.35': 'disk', 'model.layers.36': 'disk', 'model.layers.37': 'disk', 'model.layers.38': 'disk', 'model.layers.39': 'disk', 'model.layers.40': 'disk', 'model.layers.41': 'disk', 'model.layers.42': 'disk', 'model.layers.43': 'disk', 'model.layers.44': 'disk', 'model.layers.45': 'disk', 'model.layers.46': 'disk', 'model.layers.47': 'disk', 'model.layers.48': 'disk', 'model.layers.49': 'disk', 'model.layers.50': 'disk', 'model.layers.51': 'disk', 'model.layers.52': 'disk', 'model.layers.53': 'disk', 'model.layers.54': 'disk', 'model.layers.55': 'disk', 'model.layers.56': 'disk', 'model.layers.57': 'disk', 'model.layers.58': 'disk', 'model.layers.59': 'disk', 'model.layers.60': 'disk', 'model.norm': 'disk', 'lm_head': 'disk'})

vllm 0.7.1
transformers 4.48.2
accelerate 1.0.1

Running on a node with 8xH100

The text was updated successfully, but these errors were encountered:

endic-sam928281 · 2025-02-04T10:28:55Z

Hello, we tried to solve the issue.

This is what we did:

Modified the calculate_offload_device_map function to better utilize available GPU memory. The changes include:

Increased the reserved memory for quantization and hessians.
Added a safety margin to prevent GPU out of memory errors.
Adjusted the memory calculation to account for the model's total size.
Implemented a more balanced distribution of layers across available GPUs.

You can review changes in this commit: endic-sam928281@757adce.

Caution

Disclaimer: The concept of solution was created by AI and you should never copy paste this code before you check the correctness of generated code. Solution might not be complete, you should use this code as an inspiration only.

Latta AI seeks to solve problems in open source projects as part of its mission to support developers around the world. Learn more about our mission at https://latta.ai/ourmission . If you no longer want Latta AI to attempt solving issues on your repository, you can block this account.

yunkchen · 2025-02-10T04:59:01Z

same problem~

kylesayrs self-assigned this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whole model gets offloaded to the CPU #1122

Whole model gets offloaded to the CPU #1122

SzymonOzog commented Feb 4, 2025

endic-sam928281 commented Feb 4, 2025

yunkchen commented Feb 10, 2025

Whole model gets offloaded to the CPU #1122

Whole model gets offloaded to the CPU #1122

Comments

SzymonOzog commented Feb 4, 2025

endic-sam928281 commented Feb 4, 2025

yunkchen commented Feb 10, 2025