You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
02/14 16:00:43 - mmengine - INFO - Load checkpoint from work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_castdet_init_iter20k.pth
02/14 16:00:43 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
02/14 16:00:43 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
02/14 16:00:43 - mmengine - INFO - Checkpoints will be saved to /data1/amax/CastDet-main_obb/work_dirs/visdrone_step2_castdet_12b_10k_oriented.
start EMA update at 0
Traceback (most recent call last):
File "./tools/train.py", line 125, in
main()
File "./tools/train.py", line 121, in main
runner.train()
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/loops.py", line 289, in run
self.run_iter(data_batch)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/loops.py", line 313, in run_iter
outputs = self.runner.model.train_step(
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Successfully save queue in work_dirs/visdrone_step2_castdet_12b_10k_oriented/2_save_queue_samples.npz
start EMA update at 0
Additional information
Hello author. I tried to run Oriented CastDet and found something wrong in merge_weights.py. The base model is oriented-rcnn but in merge_weights.py there are only two alternatives--'soft-teacher' and 'fast-rcnn' to choose. No error appeared when I merge weights, but I am not able to run step 3 by the error pasted above. I will so appreciate if I can hear your reply.
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
master branch https://github.com/open-mmlab/mmrotate
Environment
sys.platform: linux
Python: 3.8.20 (default, Oct 3 2024, 15:24:27) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Cuda compilation tools, release 11.1, V11.1.74
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:
TorchVision: 0.11.2+cu111
OpenCV: 4.11.0
MMEngine: 0.10.6
MMRotate: 1.0.0rc1+
Reproduces the problem - code sample
Reproduces the problem - command or script
Reproduces the problem - error message
unexpected key in source state_dict: student.roi_head.bbox_head.fc_cls.weight, student.roi_head.bbox_head.fc_cls.bias, teacher.roi_head.bbox_head.fc_cls.weight, teacher.roi_head.bbox_head.fc_cls.bias
missing keys in source state_dict: words, student.roi_head.bbox_head.fc_cls.words, student.roi_head.bbox_head.fc_cls.bg, student.roi_head.bbox_head.fc_cls.logit_scale, student.roi_head.bbox_head.fc_cls.fc_proj.weight, student.roi_head.bbox_head.fc_cls.fc_proj.bias, teacher.roi_head.bbox_head.fc_cls.words, teacher.roi_head.bbox_head.fc_cls.bg, teacher.roi_head.bbox_head.fc_cls.logit_scale, teacher.roi_head.bbox_head.fc_cls.fc_proj.weight, teacher.roi_head.bbox_head.fc_cls.fc_proj.bias
02/14 16:00:43 - mmengine - INFO - Load checkpoint from work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_castdet_init_iter20k.pth
02/14 16:00:43 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
02/14 16:00:43 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
02/14 16:00:43 - mmengine - INFO - Checkpoints will be saved to /data1/amax/CastDet-main_obb/work_dirs/visdrone_step2_castdet_12b_10k_oriented.
start EMA update at 0
Traceback (most recent call last):
File "./tools/train.py", line 125, in
main()
File "./tools/train.py", line 121, in main
runner.train()
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/loops.py", line 289, in run
self.run_iter(data_batch)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/loops.py", line 313, in run_iter
outputs = self.runner.model.train_step(
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
, and bymaking sure all
forward
function outputs participate in calculating loss.If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's
forward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable).Parameter indices which did not receive grad for rank 1: 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Successfully save queue in work_dirs/visdrone_step2_castdet_12b_10k_oriented/2_save_queue_samples.npz
start EMA update at 0
Additional information
Hello author. I tried to run Oriented CastDet and found something wrong in merge_weights.py. The base model is oriented-rcnn but in merge_weights.py there are only two alternatives--'soft-teacher' and 'fast-rcnn' to choose. No error appeared when I merge weights, but I am not able to run step 3 by the error pasted above. I will so appreciate if I can hear your reply.
您好,我在复现Oriented CastDet的时候出现了问题。在第二步(merge_weights)中使用的merge_weights.py的base model里没有oriented-rcnn,我分别使用soft-teacher和faster-rcnn进行权重合并都会在第三步self-training的时候报错。我认为可能是合并权重的过程出了问题才会导致后续的训练失败,报错信息已经粘贴了,希望您可以为我答疑解惑!
The text was updated successfully, but these errors were encountered: