Skip to content

How to pretrain the vision encoder? #124

@Chunchunwumu

Description

@Chunchunwumu

We need to continue unified structure learning on the Docowl1.5-stage1 model using some private data, followed by LoRA fine-tuning for the Document Parsing task. We modified the parameters in the 'finetune-docowl.sh' script based on the original paper, setting tune_vision2text=True, freeze_vision_model=False, and freeze_base_model=True in order to perform unified structure learning. After this stage, the model was able to perform inference normally. We then proceeded with fine-tuning the model for the 'Document Parsing' task using the 'finetune-docowl_lora.sh' script, aiming to further improve its performance. During this fine-tuning process, the model’s loss decreased as expected. However, after applying LoRA fine-tuning, the model's inference results became confused. That said, we were able to achieve the desired results by directly applying LoRA fine-tuning to the Docowl1.5 S1 model.

We would appreciate any suggestions you might have regarding our experimental design to help us achieve the expected results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions