Get's stuck during training initialization #26

guragamb · 2021-06-12T16:54:46Z

Hi! I was trying to get the repo working but it gets stuck right before the training begins (all the files are in the correct directories, python packages are exactly the same as what you outline in the requirements.txt file).

Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to /host-machine/semKITTI/lidar-bonnetal/logging/ for further reference.
Sequences folder exists! Using sequences from /host-machine/semKITTI/Data/SemanticKitti/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 19130 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from /host-machine/semKITTI/Data/SemanticKitti/sequences
parsing seq 08
Using 4071 scans from sequences [8]
Loss weights from content:  tensor([  0.0000,  22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
        887.2239, 963.8915,   5.0051,  63.6247,   6.9002, 203.8796,   7.4802,
         13.6315,   3.7339, 142.1462,  12.6355, 259.3699, 618.9667])
Using SqueezeNet Backbone
Depth of backbone input =  5
Original OS:  16
New OS:  16
Strides:  [2, 2, 2, 2]
Decoder original OS:  16
Decoder new OS:  16
Decoder strides:  [2, 2, 2, 2]
Using CRF!
Total number of parameters:  928889
Total number of parameters requires_grad:  928884
Param encoder  735676
Param decoder  181248
Param head  11540
Param CRF  425
No path to pretrained, using random init.
Training in device:  cuda
Let's use 2 GPUs!
Ignoring class  0  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([0])
[IOU EVAL] INCLUDE:  tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19])

It gets stuck after the last line and doesn't do anything (nothing gets updated to the logs either). I know you mentioned that you referenced the RangeNet++ project for the development of SqueezeSegV3 and they have a similar issue where training gets stuck (I was able to reproduce the error on their repo as well) - PRBonn/lidar-bonnetal#39.
Would really appreciate if you had any thoughts on why this might be happening!

The text was updated successfully, but these errors were encountered:

chenfengxu714 · 2021-06-14T02:09:43Z

Hi, thanks for your inform. It seems that some dependencies have conflicts now. I have updated a new requirements file and some codes, and tested on a new machine. Everything seems to work now.

Meanwhile, these codes needs large GPU memory. I train V321 with mini-batch 2 on each GPU (24G) and V351 with mini-batch 1 on each GPU (24G). The whole networks are trained on 8 GPUs with syncBN. If you don't have much GPU memory, I highly recommend you to use less width setting, e.g, 1024 or 512, and try training from my pretrained models instead of training from scratch. We discover that this training method can be close to training from scratch with 2048 width.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get's stuck during training initialization #26

Get's stuck during training initialization #26

guragamb commented Jun 12, 2021 •

edited

Loading

chenfengxu714 commented Jun 14, 2021

Get's stuck during training initialization #26

Get's stuck during training initialization #26

Comments

guragamb commented Jun 12, 2021 • edited Loading

chenfengxu714 commented Jun 14, 2021

guragamb commented Jun 12, 2021 •

edited

Loading