Skip to content

DOT Training was very slow on GoogleColab A100 instance #2

@masato-ka

Description

@masato-ka

We migrated to the latest Lerobot and ran the DOT training with my dataset on an A100 instance of Google Colab.
See in:
https://github.com/masato-ka/lerobot/tree/policy/dot_policy
https://huggingface.co/datasets/masato-ka/so100_lego_sort

However, the learning was slower than ACT performed in the same environment. The logs show that data loading and gradient calculation are slower than expected.

ACT on ColabA100

NFO 2025-05-02 05:02:17 ts/train.py:232 step:200 smpl:2K ep:2 epch:0.03 loss:6.786 grdn:154.530 lr:1.0e-05 updt_s:0.066 data_s:0.004 
INFO 2025-05-02 05:02:29 ts/train.py:232 step:400 smpl:3K ep:4 epch:0.07 loss:3.049 grdn:85.140 lr:1.0e-05 updt_s:0.055 data_s:0.000 
INFO 2025-05-02 05:02:40 ts/train.py:232 step:600 smpl:5K ep:5 epch:0.10 loss:2.572 grdn:75.739 lr:1.0e-05 updt_s:0.056 data_s:0.000

DOT on ColabA100

NFO 2025-05-02 05:04:38 ts/train.py:232 step:200 smpl:2K ep:2 epch:0.03 loss:0.205 grdn:2.208 lr:1.0e-04 updt_s:0.111 data_s:0.241
INFO 2025-05-02 05:05:45 ts/train.py:232 step:400 smpl:3K ep:4 epch:0.07 loss:0.126 grdn:1.835 lr:1.0e-04 updt_s:0.101 data_s:0.235
INFO 2025-05-02 05:06:52 ts/train.py:232 step:600 smpl:5K ep:5 epch:0.10 loss:0.117 grdn:1.733 lr:1.0e-04 updt_s:0.097 data_s:0.238

Act on M2 Macbook Air

INFO 2025-05-02 13:42:05 ts/train.py:232 step:200 smpl:2K ep:2 epch:0.03 loss:6.827 grdn:155.095 lr:1.0e-05 updt_s:0.759 data_s:0.037
INFO 2025-05-02 13:45:06 ts/train.py:232 step:400 smpl:3K ep:4 epch:0.07 loss:3.058 grdn:85.390 lr:1.0e-05 updt_s:0.901 data_s:0.001

DOT on M2 Macbook Air

INFO 2025-05-02 13:49:46 ts/train.py:232 step:200 smpl:2K ep:2 epch:0.03 loss:0.234 grdn:19.501 lr:1.0e-04 updt_s:0.246 data_s:0.040
INFO 2025-05-02 13:50:37 ts/train.py:232 step:400 smpl:3K ep:4 epch:0.07 loss:0.134 grdn:19.544 lr:1.0e-04 updt_s:0.249 data_s:0.001

Unless I am mistaken, the DOT gradient calculation should be faster than the ACT and the data loading should be the same.

Is this a implementation problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions