Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get's stuck during training initialization #26

Open
guragamb opened this issue Jun 12, 2021 · 1 comment
Open

Get's stuck during training initialization #26

guragamb opened this issue Jun 12, 2021 · 1 comment

Comments

@guragamb
Copy link

guragamb commented Jun 12, 2021

Hi! I was trying to get the repo working but it gets stuck right before the training begins (all the files are in the correct directories, python packages are exactly the same as what you outline in the requirements.txt file).

Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to /host-machine/semKITTI/lidar-bonnetal/logging/ for further reference.
Sequences folder exists! Using sequences from /host-machine/semKITTI/Data/SemanticKitti/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 19130 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from /host-machine/semKITTI/Data/SemanticKitti/sequences
parsing seq 08
Using 4071 scans from sequences [8]
Loss weights from content:  tensor([  0.0000,  22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
        887.2239, 963.8915,   5.0051,  63.6247,   6.9002, 203.8796,   7.4802,
         13.6315,   3.7339, 142.1462,  12.6355, 259.3699, 618.9667])
Using SqueezeNet Backbone
Depth of backbone input =  5
Original OS:  16
New OS:  16
Strides:  [2, 2, 2, 2]
Decoder original OS:  16
Decoder new OS:  16
Decoder strides:  [2, 2, 2, 2]
Using CRF!
Total number of parameters:  928889
Total number of parameters requires_grad:  928884
Param encoder  735676
Param decoder  181248
Param head  11540
Param CRF  425
No path to pretrained, using random init.
Training in device:  cuda
Let's use 2 GPUs!
Ignoring class  0  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([0])
[IOU EVAL] INCLUDE:  tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19])

It gets stuck after the last line and doesn't do anything (nothing gets updated to the logs either). I know you mentioned that you referenced the RangeNet++ project for the development of SqueezeSegV3 and they have a similar issue where training gets stuck (I was able to reproduce the error on their repo as well) - PRBonn/lidar-bonnetal#39.
Would really appreciate if you had any thoughts on why this might be happening!

@chenfengxu714
Copy link
Owner

Hi, thanks for your inform. It seems that some dependencies have conflicts now. I have updated a new requirements file and some codes, and tested on a new machine. Everything seems to work now.

Meanwhile, these codes needs large GPU memory. I train V321 with mini-batch 2 on each GPU (24G) and V351 with mini-batch 1 on each GPU (24G). The whole networks are trained on 8 GPUs with syncBN. If you don't have much GPU memory, I highly recommend you to use less width setting, e.g, 1024 or 512, and try training from my pretrained models instead of training from scratch. We discover that this training method can be close to training from scratch with 2048 width.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants