Skip to content

[BUG] Resume interrupted doesn't work for RF-DETR #336

@JeroenDelcour

Description

@JeroenDelcour

🧠 Describe the Bug

When distilling to RF-DETR, stopping and then resuming the training results in the following error:

Restoring states from the checkpoint path at /home/jeroen/pretrain/out/rf-detr-base/checkpoints/last.ckpt
Traceback (most recent call last):
  File "/home/jeroen/pretrain/distill_rfdetr_base.py", line 4, in <module>
    lightly_train.train(
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/lightly_train/_commands/train.py", line 238, in train
    train_from_config(config=config)
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/lightly_train/_commands/train.py", line 419, in train_from_config
    trainer_instance.fit(
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in fit
    call._call_and_handle_interrupt(
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 598, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 411, in _restore_modules_and_callbacks
    self.restore_callbacks()
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 328, in restore_callbacks
    call._call_callbacks_on_load_checkpoint(trainer, self._loaded_checkpoint)
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 291, in _call_callbacks_on_load_checkpoint
    callback.on_load_checkpoint(trainer, trainer.lightning_module, checkpoint)
  File "/home/jeroen/.pyenv/versions/rf-detr/lib/python3.11/site-packages/lightly_train/_callbacks/checkpoint.py", line 105, in on_load_checkpoint
    self._models.model.load_state_dict(_checkpoint.models.model.state_dict())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'RFDETRBase' object has no attribute 'load_state_dict'

🔁 Steps to Reproduce

Just run the example code from the docs, interrupt the training, then rerun with the resume_interrupted argument set to True:

import lightly_train

lightly_train.train(
    out="out/rf-detr-base",
    data="./data",
    model="rfdetr/rf-detr-base",
    resume_interrupted=True,
    )

🤖 Environment Details

  • OS: Ubuntu 24.04
  • Python version: 3.11
  • Frameworks/Libraries (with versions): lightly-train[rf-detr] 0.11.3
  • How did you install the package: pip

📌 Additional Context

The RFDETRBase object is not a Torch module, so calling load_state_dict doesn't work. I'm not familiar with the way lightly-train loads checkpoints, but maybe there's a way to override the default behavior and put the weights in the right place for RF-DETR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions