FileNotFoundError: can't find *_optim_states.pt files when running finetune with HumanVLM

问题描述
在微调 HumanVLM 模型时，遇到了如下错误：
`FileNotFoundError: can't find *_optim_states.pt files in directory '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'`

根据报错内容，似乎是优化器状态文件缺失，但模型下载目录中没有相关文件（如 *_optim_states.pt），导致微调无法正常开始。

以下是我的操作步骤和遇到的问题详细描述。

复现步骤
1, 克隆 HumanVLM 项目并安装依赖。
2, 准备预训练模型和数据：

- 模型下载路径：OpenFace-CQUPT/Human_LLaVA

3, 执行微调命令：
`xtuner train HumanVLM/human_llama3_8b_instruct_siglip_so400m_large_p14_384_lora_e1_gpu8_finetune.py`
4, 遇到上述错误。

附加信息：

- 以下是我在HumanVLM/HumanVLM/human_llama3_8b_instruct_siglip_so400m_large_p14_384_lora_e1_gpu8_finetune.py文件中的修改：
```

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'google/siglip-so400m-patch14-384'
# Specify the pretrained pth
#pretrained_pth = './work_dirs/human_llama3_8b_instruct_siglip_so400m_large_p14_384_e1_gpu8_pretrain/iter_54000.pth'  # noqa: E501
pretrained_pth = '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'

# Data
#data_root = '/home/ubuntu/public-Datasets/HumanSFT/'
data_root = '/home/chou/deep/'
data_path = data_root + 'processed_from_converted_data_for_finetuning'
#data_path = data_root + 'ft_hfformat_base_attr_keypoint_0616_clean'
# data_path = data_root + 'ft_json_base_attr_keypoint_0616'
#image_folder = data_root + 'data'
image_folder = data_root + 'pt_images/train2014'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(4096 - 728)
```

- 完整日志如下：

```
01/12 22:37:29 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.85s/it]
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Processing zero checkpoint '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'
Traceback (most recent call last):
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/tools/train.py", line 364, in <module>
    main()
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/tools/train.py", line 353, in main
    runner = Runner.from_cfg(cfg)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/runner/runner.py", line 462, in from_cfg
    runner = cls(
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/runner/runner.py", line 429, in __init__
    self.model = self.build_model(model)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/runner/runner.py", line 836, in build_model
    model = MODELS.build(model)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/model/llava.py", line 109, in __init__
    pretrained_state_dict = guess_load_checkpoint(pretrained_pth)
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/model/utils.py", line 313, in guess_load_checkpoint
    state_dict = get_state_dict_from_zero_checkpoint(
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 617, in get_state_dict_from_zero_checkpoint
    return _get_state_dict_from_zero_checkpoint(ds_checkpoint_dir,
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 228, in _get_state_dict_from_zero_checkpoint
    optim_files = get_optim_files(ds_checkpoint_dir)
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 103, in get_optim_files
    return get_checkpoint_files(checkpoint_dir, '*_optim_states.pt')
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 96, in get_checkpoint_files
    raise FileNotFoundError(
FileNotFoundError: can't find *_optim_states.pt files in directory '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileNotFoundError: can't find *_optim_states.pt files when running finetune with HumanVLM #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

FileNotFoundError: can't find *_optim_states.pt files when running finetune with HumanVLM #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions