-
Notifications
You must be signed in to change notification settings - Fork 221
Description
Reproducing self forcing DMD results in 8x H200
Great work!, I tried to reproduce the results in a 8x H200 setting starting from self forcing ode checkpoint.
Config:
ODE checkpoint: from self forcing provided ODE checkpoint.
issue:
I encountered some issues where I was unable to stabilize the critic loss, and the dmd was so noisy and often ended up in a strange plateau
measurement:
visual comparison with (off the dataset prompts, and prompts from dataset)
prompt:
"A Porsche, sleek and black, races swiftly along the asphalt. It weaves through the landscape against a backdrop of destroyed houses and skyscrapers cloaked in moss. As dawn breaks, the crimson sun ascends into the sky."
self forcing check point
dmd_generation.mp4
my checkpoint
iter_000600_off_dset.mp4
training settings:
adjusting for H200 (can support batch size per gpu =3)
lr: 2.0e-06
lr_critic: 1.0e-06
# 4.0 e^{-07}, then 5.0 e^{-07} and then 6.0 e^{-07} and 2.0 e^{-06} dont show downward trend (keeping everything else fixed)
beta1: 0.0
beta2: 0.999
beta1_critic: 0.0
beta2_critic: 0.999
batch_size: 2
gradient_accumulation_steps: 4 # Effective batch size = 2 × 8 GPUs × 4 = 64
generator loss
My dmd loss curve quickly gets stuck in a plateau, the thick line is ema (0.9)

critic loss
the critic seems to be struggling to keep up with the generator, and i have tried increasing learning rate from 4.0 e^{-07} (paper setting for DMD) all the way to 1.0e-06

generator grad norm
generator grad norm updates are of low amplitude

current estimation
The differences in H200 and H100 via the 4 serial accumulation steps, causes compounding floating point errors, which makes reproducing the DMD checkpoint much more difficult in the 8 GPU setting, low critic LR also causes it to climb up fast, forcing the model to enter the nearest fake vaccum in earlier time steps.
But I am not quite sure of this, if anyone has experience in Hyperparameter Tuning in an adversarial setting, if anyone has a suggestion will be happy to try it out! Thanks!