Skip to content

reproducing self forcing DMD results in 8x H200 #73

@krakhit

Description

@krakhit

Reproducing self forcing DMD results in 8x H200

Great work!, I tried to reproduce the results in a 8x H200 setting starting from self forcing ode checkpoint.

Config:

ODE checkpoint: from self forcing provided ODE checkpoint.

issue:

I encountered some issues where I was unable to stabilize the critic loss, and the dmd was so noisy and often ended up in a strange plateau

measurement:

visual comparison with (off the dataset prompts, and prompts from dataset)

prompt:

"A Porsche, sleek and black, races swiftly along the asphalt. It weaves through the landscape against a backdrop of destroyed houses and skyscrapers cloaked in moss. As dawn breaks, the crimson sun ascends into the sky."

self forcing check point

dmd_generation.mp4

my checkpoint

iter_000600_off_dset.mp4

training settings:

adjusting for H200 (can support batch size per gpu =3)

lr: 2.0e-06
lr_critic: 1.0e-06 
# 4.0 e^{-07}, then 5.0 e^{-07} and then 6.0 e^{-07} and 2.0 e^{-06} dont show downward trend (keeping everything else fixed)
beta1: 0.0
beta2: 0.999
beta1_critic: 0.0
beta2_critic: 0.999
batch_size: 2
gradient_accumulation_steps: 4  # Effective batch size = 2 × 8 GPUs × 4 = 64

generator loss

My dmd loss curve quickly gets stuck in a plateau, the thick line is ema (0.9)
Image

critic loss

the critic seems to be struggling to keep up with the generator, and i have tried increasing learning rate from 4.0 e^{-07} (paper setting for DMD) all the way to 1.0e-06
Image

generator grad norm

generator grad norm updates are of low amplitude
Image

current estimation

The differences in H200 and H100 via the 4 serial accumulation steps, causes compounding floating point errors, which makes reproducing the DMD checkpoint much more difficult in the 8 GPU setting, low critic LR also causes it to climb up fast, forcing the model to enter the nearest fake vaccum in earlier time steps.

But I am not quite sure of this, if anyone has experience in Hyperparameter Tuning in an adversarial setting, if anyone has a suggestion will be happy to try it out! Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions