Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training resumes not as expected #81

Open
Borshig opened this issue Mar 27, 2021 · 10 comments
Open

Training resumes not as expected #81

Borshig opened this issue Mar 27, 2021 · 10 comments

Comments

@Borshig
Copy link

Borshig commented Mar 27, 2021

Describe the bug
I use --resume <Path to .pkl file> to resume learning after stopping.
and i notice few difference with stylegan2:
first, after resuming, after first tick NN begin retraining as if it learns first time. looks like its dont know about previous training.

To Reproduce
Steps to reproduce the behavior:

  1. In 'root' directory, run command 'python train.py --outdir=./training-runs --data=./dataset/f256.zip --gpus=1 --resume=./training-runs/00012-f256-auto1-kimg10000-batch4-resumecustom/network-snapshot-000600.pkl --batch 4 --kimg 10000 --snap 4 --isnap 2 --nobench=True --metrics=rfid5k'

*here i add isnap arg for images snapshots and metrics rfid5k as redused fid50k for small dataset

  1. i think that its took number of kimg from file name, but began to doubt when, after initialization, the file name is reset to zero, metrics begin fly up from 125 to 336 for example https://ibb.co/2sT8zY8
    in stylegan2 i could set value of iterations in kimg's and its resuming fine.

i found logs from stylegan2, after resuming metrics fine here https://ibb.co/xS52vDC

Expected behavior
I expected resuming as in stylegan2. dont want retrain from beginning.
May be i can set value for resuming from or may be its released, but not documented (idk)

Sorry if I offended you

Desktop (please complete the following information):

  • OS: Windows 10
  • PyTorch version: 1.7.1
  • CUDA toolkit version: CUDA 10.0
  • NVIDIA driver version: 461.09
  • GPU: RTX2070
  • Docker: did you use Docker? No.

Additional context
may be it's part of Ada's work ?
I think you can help me.
P.S. Thx you for your work. PyTorch implementation most likely required a lot of effort from you.
Ada is AWESOME.

@Borshig
Copy link
Author

Borshig commented Apr 11, 2021

@nurpax i see you are actively answering the questions, can you help me?
May be i should use additional args?

@jpkos
Copy link

jpkos commented Apr 14, 2021

What augment values are you getting? I had similar problems after resuming training, the output images were distorted (rotated and with weird colors). First I thought this was somehow caused by the training resuming from kimg = 0 like you mentioned, so I changed the code so that it reads the initial kimg value from the .pkl file name. Then I noticed that the augmentation parameter was increasing without limit, which probably caused the augmentations to leak to the output images, leading to the distorted images. When I changed to fixed augmentation, the problem went away.

@Borshig
Copy link
Author

Borshig commented Apr 14, 2021

@jpkos, Yeah same problems with augmentation value. I tried fixed aug, but my model didnt want to train or didnt train so fast (May be its problem because I have small dataset) i ajust augmentation parameters and turn off rotate90, lumaflip. Images stop rotating, but after 600kimgs Images became very Green and bright. Looks like its problem with endless increasting aug value. Idk how did NVlabs do it, I have no same beast GPUs to do a lot of test and reproduse their result

@woctezuma
Copy link

I use --resume <Path to .pkl file> to resume learning after stopping.

--resume is for transfer learning, not for resuming training in a stop-and-go fashion.

@Borshig
Copy link
Author

Borshig commented May 1, 2021

Я использую --resume <путь к файлу .pkl>, чтобы продолжить обучение после остановки.

--resume предназначен для передачи обучения, а не для непрерывного возобновления обучения.
@woctezuma and what are you using for resume?

@woctezuma
Copy link

I use --resume latest from #3 to provide initial values for augmentation strength and current kimg.

@straeter
Copy link

straeter commented Dec 9, 2021

as I understand it, in Stylegan2 (previous version) there was a "resume_kimg" option for the training_loop.py which was not really used by train.py

You can easily implement this though: just add resume_kimg as a parameter to train.py (with default 0) and pass it over to training_loop.py and then set cur_nimg = int(resume_kimg * 1000) instead of cur_nimg = 0 in training_loop.py

@woctezuma
Copy link

Your fix is what was done here: 64efea2

@whyydsforever
Copy link

whyydsforever commented Mar 8, 2022

@woctezuma
What if the "resume_pkl" in the "training_loop.py"? If I just need to modify "resume_pkl" to my .pkl and keep the same command as past for "train.py"

@dookiethedog
Copy link

My Gan crashed and I was extremely annoyed as I was experiencing the exact same issue so I decided to read into the code. Setting the inital augmentation and kimg will not actually continue the training from when it last ran. The Dev's don't seem to care if it does crash as there is no proper resume code, I was actually able to modify the code and create a perfect resume function, however, I will not be able to resume from my first Gan as I did not have my code added yet so there is no way to pull the settings needed, but at least for future I will be all good and have everything stored in the pickle file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants