I use 2 a100 with 90GB VRAM.
In the training process, it stuck in fsdp_wrap of real_score, which turns the Wan2.1-T2V-14B model to FSDP and my a100 exited with no error message and signals.
The code is below:
The running message is below:(I add the [DEBUG] message and when each of the generator, real_score, fake_score turns to FSDP, it will print the corresponding message)
