training problem #15

Whiplash-18 · 2022-12-05T07:46:06Z

when I trained the model on panoptic datasets and met such problem. and I use the torch1.13, cuda 11.8.
File "/workspace/faster_voxel_pose/run/train.py", line 181, in
main()
File "/workspace/faster_voxel_pose/run/train.py", line 151, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 41, in train_3d
final_poses, poses, proposal_centers, loss_dict, input_heatmap = model(views=inputs, meta=meta, targets=targets,
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/human_detection_net.py", line 94, in forward
proposal_heatmaps_1d = self.c2c_net(torch.flatten(feature_1d, 0, 1)).view(batch_size, self.max_people, -1)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/cnns_1d.py", line 131, in forward
hm = self.output_hm(x)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/workspace/faster_voxel_pose/run/train.py", line 181, in
main()
File "/workspace/faster_voxel_pose/run/train.py", line 151, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 71, in train_3d
accu_loss.backward()
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1]] is at version 7; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

cucdengjunli · 2022-12-16T11:50:12Z

same question

cucdengjunli · 2022-12-16T12:22:40Z

maybe you need to use V100

gpastal24 · 2023-01-02T21:05:20Z

@Whiplash-18 I had the same problem, you have to use torch 1.4 in order to train the models, so you will need a gpu which supports cuda 10.x

AlvinYH · 2023-07-23T16:07:45Z

Hi, @Whiplash-18. Thanks for your interest in our work. Yes, there exists a bug in our former implementation. And we solved this problem by using two optimizers to learn HDN and JLN, respectively. We've revised the code and you can pull the recent release. Now it can support a higher PyTorch version (>1.4).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training problem #15

training problem #15

Whiplash-18 commented Dec 5, 2022

cucdengjunli commented Dec 16, 2022

cucdengjunli commented Dec 16, 2022

gpastal24 commented Jan 2, 2023

AlvinYH commented Jul 23, 2023

training problem #15

training problem #15

Comments

Whiplash-18 commented Dec 5, 2022

cucdengjunli commented Dec 16, 2022

cucdengjunli commented Dec 16, 2022

gpastal24 commented Jan 2, 2023

AlvinYH commented Jul 23, 2023