Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training problem #15

Open
Whiplash-18 opened this issue Dec 5, 2022 · 4 comments
Open

training problem #15

Whiplash-18 opened this issue Dec 5, 2022 · 4 comments

Comments

@Whiplash-18
Copy link

when I trained the model on panoptic datasets and met such problem. and I use the torch1.13, cuda 11.8.
File "/workspace/faster_voxel_pose/run/train.py", line 181, in
main()
File "/workspace/faster_voxel_pose/run/train.py", line 151, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 41, in train_3d
final_poses, poses, proposal_centers, loss_dict, input_heatmap = model(views=inputs, meta=meta, targets=targets,
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/human_detection_net.py", line 94, in forward
proposal_heatmaps_1d = self.c2c_net(torch.flatten(feature_1d, 0, 1)).view(batch_size, self.max_people, -1)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/faster_voxel_pose/run/../lib/models/cnns_1d.py", line 131, in forward
hm = self.output_hm(x)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/workspace/faster_voxel_pose/run/train.py", line 181, in
main()
File "/workspace/faster_voxel_pose/run/train.py", line 151, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/workspace/faster_voxel_pose/run/../lib/core/function.py", line 71, in train_3d
accu_loss.backward()
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/envs/faster_voxel_pose/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1]] is at version 7; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

@cucdengjunli
Copy link

same question

@cucdengjunli
Copy link

maybe you need to use V100

@gpastal24
Copy link

@Whiplash-18 I had the same problem, you have to use torch 1.4 in order to train the models, so you will need a gpu which supports cuda 10.x

@AlvinYH
Copy link
Owner

AlvinYH commented Jul 23, 2023

Hi, @Whiplash-18. Thanks for your interest in our work. Yes, there exists a bug in our former implementation. And we solved this problem by using two optimizers to learn HDN and JLN, respectively. We've revised the code and you can pull the recent release. Now it can support a higher PyTorch version (>1.4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants