使用gtx 3080 ti 训练 campus数据集时，batch=1 仍然 out of memory #27

anzisheng · 2023-04-28T02:43:08Z

我的训练环境如下：
python 3.7
torch 1.4
显卡 gtx3080 Ti, 显存12G。
为了节省显存，我把batch设为1，SYNTHETIC 的NUM_DATA设为1000，
运行作者提供的train.py时, 在epoch = 0时 run了一会就会报out of memory：
错误信息如下：
`Epoch: 0
Save the sampling grid in HDN for sequence synthetic
Epoch: [0][0/1000] Time: 563.691s (563.691s) Speed: 0.0 samples/s Data: 6.174s (6.174s) Loss: nan (nan) Loss_2d: 0.0008510 (0.0008510) Loss_1d: nan (nan) Loss_bbox: 0.012933 (0.012933) Loss_joint: nan (nan) Memory 292969472.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000000
Save the sampling grid in JLN for sequence synthetic
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000100
Epoch: [0][100/1000] Time: 0.078s (5.717s) Speed: 12.8 samples/s Data: 0.000s (0.064s) Loss: nan (nan) Loss_2d: 0.0008510 (nan) Loss_1d: nan (nan) Loss_bbox: 0.015552 (383651026093232355625853440229376.000000) Loss_joint: nan (nan) Memory 2886614528.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000200
Epoch: [0][200/1000] Time: 0.077s (2.913s) Speed: 13.0 samples/s Data: 0.000s (0.034s) Loss: nan (nan) Loss_2d: 0.0050989 (nan) Loss_1d: nan (nan) Loss_bbox: 0.039250 (1752363751526759131174129536860160.000000) Loss_joint: nan (nan) Memory 5246115328.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000300
Epoch: [0][300/1000] Time: 0.079s (1.973s) Speed: 12.7 samples/s Data: 0.000s (0.024s) Loss: nan (nan) Loss_2d: inf (nan) Loss_1d: nan (nan) Loss_bbox: 0.025086 (1170183103178998798672389530976256.000000) Loss_joint: nan (nan) Memory 7605616128.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
Epoch: [0][400/1000] Time: 0.078s (1.502s) Speed: 12.9 samples/s Data: 0.000s (0.019s) Loss: nan (nan) Loss_2d: 0.0042343 (nan) Loss_1d: nan (nan) Loss_bbox: 0.034341 (878474771048066240862024315699200.000000) Loss_joint: nan (nan) Memory 9965116928.0
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
helloo E:\project\Faster-VoxelPose_23_04_25\output\campus\voxelpose_50\jln64\train_00000400
Traceback (most recent call last):
File "E:\project\Faster-VoxelPose_23_04_25\run\train.py", line 171, in
main()
File "E:\project\Faster-VoxelPose_23_04_25\run\train.py", line 140, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\core\function.py", line 45, in train_3d
cameras=cameras, resize_transform=resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\parallel\data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\voxelpose.py", line 38, in forward
bbox_preds = self.pose_net(input_heatmaps, meta, cameras, resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\human_detection_net.py", line 81, in forward
feature_cubes = self.project_layer(heatmaps, meta, cameras, resize_transform)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "E:\project\Faster-VoxelPose_23_04_25\run..\lib\models\project_whole.py", line 84, in forward
cubes[i] = torch.mean(F.grid_sample(heatmaps[i], shared_sample_grid, align_corners=True), dim=0).squeeze(0)
File "E:\env\py37_pt140_cu101\lib\site-packages\torch\nn\functional.py", line 2711, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 12.00 GiB total capacity; 11.21 GiB already allocated; 0 bytes free; 11.26 GiB reserved in total by PyTorch)

Process finished with exit code 1
`
想请问各位还要什么方法可以减小训练时候的显存消耗来保证12G显存可以训练？

912267428 · 2023-07-15T02:37:55Z

一样的问题，请问您找到解决办法了吗

AlvinYH · 2023-07-23T16:00:04Z

Thanks for your interest in our work. We've modified the code and you can pull the recent release. You can try to retrain the model. We conducted our experiments on the Campus dataset with batch size=8, and it worked fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用gtx 3080 ti 训练 campus数据集时，batch=1 仍然 out of memory #27

使用gtx 3080 ti 训练 campus数据集时，batch=1 仍然 out of memory #27

anzisheng commented Apr 28, 2023

912267428 commented Jul 15, 2023

AlvinYH commented Jul 23, 2023

使用gtx 3080 ti 训练 campus数据集时，batch=1 仍然 out of memory #27

使用gtx 3080 ti 训练 campus数据集时，batch=1 仍然 out of memory #27

Comments

anzisheng commented Apr 28, 2023

912267428 commented Jul 15, 2023

AlvinYH commented Jul 23, 2023