Skip to content

πŸ› [Bug] Disable CPU offloading by default in MTTMΒ #3728

@peri044

Description

@peri044

Bug Description

CPU offloading is enabled by default in MTTM which causes device mismatch issues for embedding layers.
Eg:
Here is the code for VLM component of Groot model: https://github.com/NVIDIA/Isaac-GR00T/blob/main/gr00t/model/backbone/eagle2_hg_model/modeling_eagle2_5_vl.py#L235

Once the language model is compiled with MTTM, it is moved to CPU. So, this operation fails since input_ids tensor is on the GPU while the embedding layer (self.embed_tokens) is on CPU.

offload_module_to_cpu isn't supported in MTTM. So adding the support will fix this issue. The following works

if self.additional_settings.get("offload_module_to_cpu", False):
     deallocate_module(self.original_model, delete_module=False)

But there are multiple places where deallocate_module is being used which needs to be investigated.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0):
  • PyTorch Version (e.g. 1.0):
  • CPU Architecture:
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, libtorch, source):
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version:
  • CUDA version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions