-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bad substitution报错 #10
Comments
你试着把 sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth |
作者您好,我试了一下你的方案,还是出现如下报错: |
估计是你的sh版本的问题,你可以用 或者,使用 |
谢谢作者,sh版本问题已解决。后续运行出现多卡分布式训练的问题:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24696),希望请教一下作者。初步考虑似乎是--local_rank=0这个参数指定GPU的问题,一开始以为是指定的GPU被占用了,尝试修改--local_rank但没有效果。报错如下:
Failures:
|
作者您好,上面我提到的分布式训练问题基本解决了,通过将dist_train.sh文件中的python torch.distributed.launch 直接改为torchrun即可。现在的报错主要是The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use),具体如下,尝试指定不同的GPU,暂时也没有效果,希望请教作者的见解,谢谢。
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appl [W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:25641 (errno: 98 - Address already in use). |
我猜想,你把最新命令里的--nproc_per_node=2改为--nproc_per_node=1应该就可以了。你之前端口冲突了,需要用--master_port=25641修改端口 |
作者您好,您说的这个我也试过,刚才又跑了一下,并尝试减小batch size,但还是会出现timeout的情况,看着像是GPU被占用产生的问题?
Failures:
|
~/WCL/KD/DIST_KD-main/classification$ sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml ${cifar_resnet20} --teacher-model ${cifar_resnet56} --experiment ${checkpoint} --teacher-ckpt ${'./ckpt/ckpt_epoch_240.pth'}
bash: ${'./ckpt/ckpt_epoch_240.pth'}: bad substitution
作者您好,我在跑cifar结果时,已经把ckpt文件下载好并指定路径,但出现如上bad substitution报错,请教作者解决方法,谢谢!
The text was updated successfully, but these errors were encountered: