-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime Error when resuming training #521
Comments
Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU? |
Yes, when using single GPU, the same Runtime Error occurs. faster-rcnn.pytorch/trainval_net.py Line 283 in 0797f62
which means the saved state of optimizer wont't be loaded when resuming training, and actually it works, the Runtme Error never occurs again, and the training goes on. But I've no idea whether this is the right solution, and whether it will affect later traning prosess, neither do I know what on earth caused this problem....... |
I haven't been able to recreate your issue. Could you please send me the errors you get for 1 GPU and multiple GPUs?
Maybe I can spot an abnormality |
Sure. The errors I got for 1 GPU and multiple GPUs are the same, and both of them are as my description above, which is
The code snippet I‘m using is
and by the way, the code actually wasn't modified, and is exactly the same as the master branch of this repo. Thanks~ |
Could you try and modify your code as suggested in this comment It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well. |
Thanks... I've tried this code, I changed pytorch to 1.0 and used the 1.0 branch, and then moved |
And you still get the same error? Everything should work pretty much out-of-the-box; git pull and run.. EDIT: What version of |
Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training... But as I descripted before, if I comment these two lines
in I'm using I'm using torchvision 0.2.1(Build py35_1).
|
You could try to update your But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing. |
Thanks a lot~ I'd try a higher |
@HViktorTsoi |
Yes, I solved the problem by this |
Hi, I am having a same problem. I try to load the model that has been trained on pytorch==1.2.0 version. When I load the model in pytorch==1.6.0 version and resumed training, Would loading optimizer that has been trained on different version be an issue?? |
thanks! I've encountered the same issue and this solution works for me.
|
I was trainig using multiple GPUs on my own dataset, but when resuming training, I got this error
Environment:
Pytorch 0.4.0
CUDA 9.0
cuDNN 7.1.2
Python 3.5
GPUs: 4 x Tesla V100
Command line I used:
CUDA_VISIBLE_DEVICES=2,3,4,5 python trainval_net.py --dataset virtual_sign_2019 --net vgg16 --bs 32 --nw 16 --lr 0.001 --cuda --mGPUs --r True --checksession 1 --checkepoch 3 --checkpoint 1124
I have tried everything I can to solve this problem, incluing many issue related to this, like #515 #475 #506 , but the problem still exists....... is there any possilbe solution? thanks....
The text was updated successfully, but these errors were encountered: