Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oriented CastDet merge-weights.py #13

Open
3 tasks done
NanaRobert opened this issue Feb 14, 2025 · 0 comments
Open
3 tasks done

Oriented CastDet merge-weights.py #13

NanaRobert opened this issue Feb 14, 2025 · 0 comments

Comments

@NanaRobert
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmrotate

Environment

sys.platform: linux
Python: 3.8.20 (default, Oct 3 2024, 15:24:27) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Cuda compilation tools, release 11.1, V11.1.74
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu111
OpenCV: 4.11.0
MMEngine: 0.10.6
MMRotate: 1.0.0rc1+

Reproduces the problem - code sample

  1. tools/dist_train.sh oriented-rcnn_r50-fpn_20k_visdronezsd_base-set.py
  2. python projects/CastDetv2/tools/merge_weights.py --clip_path checkpoints/RemoteCLIP-RN50.pt --base_path work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/iter_20000.pth --save_path work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_castdet_init_iter20k.pth
  3. ./tools/dist_train.sh projects/CastDetv2/configs/visdrone_step2_castdet_12b_10k_oriented.py 2

Reproduces the problem - command or script

  1. python projects/CastDetv2/tools/merge_weights.py --clip_path checkpoints/RemoteCLIP-RN50.pt --base_path work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/iter_20000.pth --save_path work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_castdet_init_iter20k.pth
  2. ./tools/dist_train.sh projects/CastDetv2/configs/visdrone_step2_castdet_12b_10k_oriented.py 2

Reproduces the problem - error message

unexpected key in source state_dict: student.roi_head.bbox_head.fc_cls.weight, student.roi_head.bbox_head.fc_cls.bias, teacher.roi_head.bbox_head.fc_cls.weight, teacher.roi_head.bbox_head.fc_cls.bias

missing keys in source state_dict: words, student.roi_head.bbox_head.fc_cls.words, student.roi_head.bbox_head.fc_cls.bg, student.roi_head.bbox_head.fc_cls.logit_scale, student.roi_head.bbox_head.fc_cls.fc_proj.weight, student.roi_head.bbox_head.fc_cls.fc_proj.bias, teacher.roi_head.bbox_head.fc_cls.words, teacher.roi_head.bbox_head.fc_cls.bg, teacher.roi_head.bbox_head.fc_cls.logit_scale, teacher.roi_head.bbox_head.fc_cls.fc_proj.weight, teacher.roi_head.bbox_head.fc_cls.fc_proj.bias

02/14 16:00:43 - mmengine - INFO - Load checkpoint from work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_castdet_init_iter20k.pth
02/14 16:00:43 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
02/14 16:00:43 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
02/14 16:00:43 - mmengine - INFO - Checkpoints will be saved to /data1/amax/CastDet-main_obb/work_dirs/visdrone_step2_castdet_12b_10k_oriented.
start EMA update at 0
Traceback (most recent call last):
File "./tools/train.py", line 125, in
main()
File "./tools/train.py", line 121, in main
runner.train()
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/loops.py", line 289, in run
self.run_iter(data_batch)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/runner/loops.py", line 313, in run_iter
outputs = self.runner.model.train_step(
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/amax/anaconda3/envs/ovad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Successfully save queue in work_dirs/visdrone_step2_castdet_12b_10k_oriented/2_save_queue_samples.npz
start EMA update at 0

Additional information

Hello author. I tried to run Oriented CastDet and found something wrong in merge_weights.py. The base model is oriented-rcnn but in merge_weights.py there are only two alternatives--'soft-teacher' and 'fast-rcnn' to choose. No error appeared when I merge weights, but I am not able to run step 3 by the error pasted above. I will so appreciate if I can hear your reply.

您好,我在复现Oriented CastDet的时候出现了问题。在第二步(merge_weights)中使用的merge_weights.py的base model里没有oriented-rcnn,我分别使用soft-teacher和faster-rcnn进行权重合并都会在第三步self-training的时候报错。我认为可能是合并权重的过程出了问题才会导致后续的训练失败,报错信息已经粘贴了,希望您可以为我答疑解惑!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant