reproducing self forcing DMD results in 8x H200

# Reproducing self forcing DMD results in 8x H200
Great work!, I tried to reproduce the results in a 8x H200 setting starting from self forcing ode checkpoint. 

## Config: 
ODE checkpoint: from self forcing provided ODE checkpoint. 

## issue: 
I encountered some issues where I was unable to stabilize the critic loss, and the dmd was so noisy and often ended up in a strange plateau

## measurement:
visual comparison with (off the dataset prompts, and prompts from dataset)

### prompt: 
"A Porsche, sleek and black, races swiftly along the asphalt. It weaves through the landscape against a backdrop of destroyed houses and skyscrapers cloaked in moss. As dawn breaks, the crimson sun ascends into the sky." 

### self forcing check point

https://github.com/user-attachments/assets/0134cfe8-b259-4ac5-b537-4eae7324c676

### my checkpoint

https://github.com/user-attachments/assets/cc31262c-1922-4d68-8bc6-c93224b45d97


## training settings: 

adjusting for H200 (can support batch size per gpu =3) 
```
lr: 2.0e-06
lr_critic: 1.0e-06 
# 4.0 e^{-07}, then 5.0 e^{-07} and then 6.0 e^{-07} and 2.0 e^{-06} dont show downward trend (keeping everything else fixed)
beta1: 0.0
beta2: 0.999
beta1_critic: 0.0
beta2_critic: 0.999
batch_size: 2
gradient_accumulation_steps: 4  # Effective batch size = 2 × 8 GPUs × 4 = 64
``` 

### generator loss
My dmd loss curve quickly gets stuck in a plateau, the thick line is ema (0.9)
<img width="383" height="306" alt="Image" src="https://github.com/user-attachments/assets/62aa0ffd-7207-413d-a39a-3a14bd2689e0" />

### critic loss
the critic seems to be struggling to keep up with the generator, and i have tried increasing learning rate from 4.0 e^{-07} (paper setting for DMD) all the way to 1.0e-06 
<img width="387" height="307" alt="Image" src="https://github.com/user-attachments/assets/92335539-38d6-439c-894d-391944331679" />

### generator grad norm
generator grad norm updates are of low amplitude
<img width="384" height="305" alt="Image" src="https://github.com/user-attachments/assets/b5bee2db-334d-4c61-9b59-549cf1aa2946" />

### current estimation

The differences in H200 and H100 via the 4 serial accumulation steps, causes compounding floating point errors, which makes reproducing the DMD checkpoint much more difficult in the 8 GPU setting, low critic LR also causes it to climb up fast, forcing the model to enter the nearest fake vaccum in earlier time steps.

But I am not quite sure of this, if anyone has experience in Hyperparameter Tuning in an adversarial setting, if anyone has a suggestion will be happy to try it out! Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reproducing self forcing DMD results in 8x H200 #73

Reproducing self forcing DMD results in 8x H200

Config:

issue:

measurement:

prompt:

self forcing check point

my checkpoint

training settings:

generator loss

critic loss

generator grad norm

current estimation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

reproducing self forcing DMD results in 8x H200 #73

Description

Reproducing self forcing DMD results in 8x H200

Config:

issue:

measurement:

prompt:

self forcing check point

my checkpoint

training settings:

generator loss

critic loss

generator grad norm

current estimation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions