Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用gtx 3080 ti 训练 campus数据集时,batch=1 仍然 out of memory #27

Open
anzisheng opened this issue Apr 28, 2023 · 2 comments

Comments

@anzisheng
Copy link

我的训练环境如下:
python 3.7
torch 1.4
显卡 gtx3080 Ti, 显存12G。
为了节省显存,我把batch设为1,SYNTHETIC 的NUM_DATA设为1000,
运行作者提供的train.py时, 在epoch = 0时 run了一会就会报out of memory:
错误信息如下:
`Epoch: 0
Save the sampling grid in HDN for sequence synthetic
Epoch: [0][0/1000] Time: 563.691s (563.691s) Speed: 0.0 samples/s Data: 6.174s (6.174s) Loss: nan (nan) Loss_2d: 0.0008510 (0.0008510) Loss_1d: nan (nan) Loss_bbox: 0.012933 (0.012933) Loss_joint: nan (nan) Memory 292969472.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000000
Save the sampling grid in JLN for sequence synthetic
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000100
Epoch: [0][100/1000] Time: 0.078s (5.717s) Speed: 12.8 samples/s Data: 0.000s (0.064s) Loss: nan (nan) Loss_2d: 0.0008510 (nan) Loss_1d: nan (nan) Loss_bbox: 0.015552 (383651026093232355625853440229376.000000) Loss_joint: nan (nan) Memory 2886614528.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000200
Epoch: [0][200/1000] Time: 0.077s (2.913s) Speed: 13.0 samples/s Data: 0.000s (0.034s) Loss: nan (nan) Loss_2d: 0.0050989 (nan) Loss_1d: nan (nan) Loss_bbox: 0.039250 (1752363751526759131174129536860160.000000) Loss_joint: nan (nan) Memory 5246115328.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000300
Epoch: [0][300/1000] Time: 0.079s (1.973s) Speed: 12.7 samples/s Data: 0.000s (0.024s) Loss: nan (nan) Loss_2d: inf (nan) Loss_1d: nan (nan) Loss_bbox: 0.025086 (1170183103178998798672389530976256.000000) Loss_joint: nan (nan) Memory 7605616128.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
Epoch: [0][400/1000] Time: 0.078s (1.502s) Speed: 12.9 samples/s Data: 0.000s (0.019s) Loss: nan (nan) Loss_2d: 0.0042343 (nan) Loss_1d: nan (nan) Loss_bbox: 0.034341 (878474771048066240862024315699200.000000) Loss_joint: nan (nan) Memory 9965116928.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000400
Traceback (most recent call last):
File "E:\project\Faster-VoxelPose_23_04_25\run\train.py", line 171, in
main()
File "E:\project\Faster-VoxelPose_23_04_25\run\train.py", line 140, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\core\function.py", line 45, in train_3d
cameras=cameras, resize_transform=resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\parallel\data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\human_detection_net.py", line 81, in forward
feature_cubes = self.project_layer(heatmaps, meta, cameras, resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\project_whole.py", line 84, in forward
cubes[i] = torch.mean(F.grid_sample(heatmaps[i], shared_sample_grid, align_corners=True), dim=0).squeeze(0)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\functional.py", line 2711, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 12.00 GiB total capacity; 11.21 GiB already allocated; 0 bytes free; 11.26 GiB reserved in total by PyTorch)

Process finished with exit code 1
`
想请问各位还要什么方法可以减小训练时候的显存消耗来保证12G显存可以训练?

@912267428
Copy link

一样的问题,请问您找到解决办法了吗

@AlvinYH
Copy link
Owner

AlvinYH commented Jul 23, 2023

Thanks for your interest in our work. We've modified the code and you can pull the recent release. You can try to retrain the model. We conducted our experiments on the Campus dataset with batch size=8, and it worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants