Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Error when resuming training #521

Open
HViktorTsoi opened this issue Apr 25, 2019 · 14 comments
Open

Runtime Error when resuming training #521

HViktorTsoi opened this issue Apr 25, 2019 · 14 comments

Comments

@HViktorTsoi
Copy link

HViktorTsoi commented Apr 25, 2019

I was trainig using multiple GPUs on my own dataset, but when resuming training, I got this error

Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
  File "trainval_net.py", line 340, in <module>
    optimizer.step()
  File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3

Environment:
Pytorch 0.4.0
CUDA 9.0
cuDNN 7.1.2
Python 3.5
GPUs: 4 x Tesla V100

Command line I used:

CUDA_VISIBLE_DEVICES=2,3,4,5 python trainval_net.py --dataset virtual_sign_2019 --net vgg16 --bs 32 --nw 16 --lr 0.001 --cuda --mGPUs --r True --checksession 1 --checkepoch 3 --checkpoint 1124

I have tried everything I can to solve this problem, incluing many issue related to this, like #515 #475 #506 , but the problem still exists....... is there any possilbe solution? thanks....

@HViktorTsoi HViktorTsoi changed the title Runtime Error when resuming training Runtime Error when resuming training using muitiple GPUs Apr 25, 2019
@HViktorTsoi HViktorTsoi changed the title Runtime Error when resuming training using muitiple GPUs Runtime Error when resuming training Apr 25, 2019
@AlexanderHustinx
Copy link

Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU?

@HViktorTsoi
Copy link
Author

HViktorTsoi commented Apr 29, 2019

Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU?

Yes, when using single GPU, the same Runtime Error occurs.
I guess it's caused by
optimizer.load_state_dict(checkpoint['optimizer'])
in trainval_net.py while resuming training, because the error message points to "optimizer.step()" everytime. I tried to comment this two lines in trainval_net.py:

optimizer.load_state_dict(checkpoint['optimizer'])

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

which means the saved state of optimizer wont't be loaded when resuming training, and actually it works, the Runtme Error never occurs again, and the training goes on. But I've no idea whether this is the right solution, and whether it will affect later traning prosess, neither do I know what on earth caused this problem.......

@AlexanderHustinx
Copy link

AlexanderHustinx commented May 3, 2019

I haven't been able to recreate your issue. Could you please send me the errors you get for 1 GPU and multiple GPUs?
Can you send me a snippet of the code you're using from
fasterRCNN.create_architecture()
till ...

  if args.use_tfboard:
    from tensorboardX import SummaryWriter
    logger = SummaryWriter("logs")

Maybe I can spot an abnormality

@HViktorTsoi
Copy link
Author

Sure. The errors I got for 1 GPU and multiple GPUs are the same, and both of them are as my description above, which is


Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
  File "trainval_net.py", line 340, in <module>
    optimizer.step()
  File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3

The code snippet I‘m using is

    fasterRCNN.create_architecture()

    lr = cfg.TRAIN.LEARNING_RATE
    lr = args.lr
    # tr_momentum = cfg.TRAIN.MOMENTUM
    # tr_momentum = args.momentum

    params = []
    for key, value in dict(fasterRCNN.named_parameters()).items():
        if value.requires_grad:
            if 'bias' in key:
                params += [{'params': [value], 'lr': lr * (cfg.TRAIN.DOUBLE_BIAS + 1), \
                            'weight_decay': cfg.TRAIN.BIAS_DECAY and cfg.TRAIN.WEIGHT_DECAY or 0}]
            else:
                params += [{'params': [value], 'lr': lr, 'weight_decay': cfg.TRAIN.WEIGHT_DECAY}]

    if args.optimizer == "adam":
        lr = lr * 0.1
        optimizer = torch.optim.Adam(params)

    elif args.optimizer == "sgd":
        optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)

    if args.cuda:
        fasterRCNN.cuda()

    if args.resume:
        load_name = os.path.join(output_dir,
                                 'faster_rcnn_{}_{}_{}.pth'.format(args.checksession, args.checkepoch, args.checkpoint))
        print("loading checkpoint %s" % (load_name))
        checkpoint = torch.load(load_name)
        args.session = checkpoint['session']
        args.start_epoch = checkpoint['epoch']
        fasterRCNN.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        lr = optimizer.param_groups[0]['lr']
        if 'pooling_mode' in checkpoint.keys():
            cfg.POOLING_MODE = checkpoint['pooling_mode']
        print("loaded checkpoint %s" % (load_name))

    if args.mGPUs:
        fasterRCNN = nn.DataParallel(fasterRCNN)

    iters_per_epoch = int(train_size / args.batch_size)

    if args.use_tfboard:
        from tensorboardX import SummaryWriter

        logger = SummaryWriter("logs")

and by the way, the code actually wasn't modified, and is exactly the same as the master branch of this repo. Thanks~

@AlexanderHustinx
Copy link

Could you try and modify your code as suggested in this comment

It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well.

@HViktorTsoi
Copy link
Author

HViktorTsoi commented May 6, 2019

Could you try and modify your code as suggested in this comment

It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well.

Thanks... I've tried this code, I changed pytorch to 1.0 and used the 1.0 branch, and then moved
if args.cuda: fasterRCNN.cuda()
above the assignment of the optimizer(which has been done in the pytorch-1.0 branch), but when resuming training, the problem still exist...

@AlexanderHustinx
Copy link

AlexanderHustinx commented May 6, 2019

And you still get the same error?

Everything should work pretty much out-of-the-box; git pull and run..
As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.

EDIT: What version of torchvision are you using?

@HViktorTsoi
Copy link
Author

HViktorTsoi commented May 7, 2019

Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training...

But as I descripted before, if I comment these two lines

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

in trainvla_net.py when resuming training, the training process goes on normally, and I've tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.

I'm using SGD optimizer, so it seems that there isn't any adverse effect so far if I don't load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.

I'm using torchvision 0.2.1(Build py35_1).

And you still get the same error?

Everything should work pretty much out-of-the-box; git pull and run..
As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.

EDIT: What version of torchvision are you using?

@AlexanderHustinx
Copy link

AlexanderHustinx commented May 7, 2019

You could try to update your torchvision version, read on a different repo that it might help.

But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing.
Sorry I couldn't help you fix the problem.

@HViktorTsoi
Copy link
Author

You could try to update your torchvision version, read on a different repo that it might help.

But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing.
Sorry I couldn't help you fix the problem.

Thanks a lot~ I'd try a higher torchvision version.

@H-YunHui
Copy link

@HViktorTsoi
my mistake is the same as yours. Have you solved it?

@HViktorTsoi
Copy link
Author

HViktorTsoi commented Sep 17, 2019

@HViktorTsoi
my mistake is the same as yours. Have you solved it?

Yes, I solved the problem by this
#521 (comment)
And it seems not having any side effect after a long time usage.

@YangJae96
Copy link

Hi, I am having a same problem.

I try to load the model that has been trained on pytorch==1.2.0 version.

When I load the model in pytorch==1.6.0 version and resumed training,
the training gets so corrupted right after optimizer.step() has been called.

Would loading optimizer that has been trained on different version be an issue??

@syr-cn
Copy link

syr-cn commented Jun 27, 2022

thanks! I've encountered the same issue and this solution works for me.

Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training...

But as I descripted before, if I comment these two lines

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

in trainvla_net.py when resuming training, the training process goes on normally, and I've tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.

I'm using SGD optimizer, so it seems that there isn't any adverse effect so far if I don't load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.

I'm using torchvision 0.2.1(Build py35_1).

And you still get the same error?
Everything should work pretty much out-of-the-box; git pull and run..
As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.
EDIT: What version of torchvision are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants