Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization

This repository provides the official implementation of RRPO.

Installation

Clone the repository and navigate to the RRPO directory:

git clone https://github.com/pritamqu/RRPO
cd RRPO

This repository supports three Large Video Language Models (LVLMs), each with its own dependency requirements:

VideoChat2: videochat2.txt
LLaVA-Video: llavavideo.txt
LongVU: longvu.txt

Example: Setting up LLaVA-Video

Follow similar steps for other models.

conda create -n llava python=3.10 -y
conda activate llava
pip install -r llavavideo.txt

Models

The self-aligned LVLMs trained with the RRPO loss are released on Hugging Face. While these models were trained using LoRA, we also provide their merged weights to allow for direct use in evaluation and inference tools.

Model	LoRA	Merged Weights (noqa)
VideoChat2_stage3_Mistral_7B-RRPO-16f	pritamqu/VideoChat2_stage3_Mistral_7B-RRPO-16f-LORA	-
LLaVA-Video-7B-Qwen2-RRPO-16f	pritamqu/LLaVA-Video-7B-Qwen2-RRPO-16f-LORA	pritamqu/LLaVA-Video-7B-Qwen2-RRPO-16f
LLaVA-Video-7B-Qwen2-RRPO-32f	pritamqu/LLaVA-Video-7B-Qwen2-RRPO-32f-LORA	pritamqu/LLaVA-Video-7B-Qwen2-RRPO-32f
LongVU_Qwen2_7B-RRPO-16f	pritamqu/LongVU_Qwen2_7B-RRPO-16f-LORA	pritamqu/LongVU_Qwen2_7B-RRPO-16f

You can download weights as:

git clone [email protected]:pritamqu/LLaVA-Video-7B-Qwen2-RRPO-32f-LORA

Dataset

Our training data is released here Self-Alignment Dataset. We release the preferred and non-preferred responses used in self-alignment training.

git clone [email protected]:datasets/pritamqu/self-alignment

The related videos can be downloaded from their original sources. Please check VideoChat-IT GitHub page regarding the details of downloading the source videos.

We also share additional details on how to use your own data here.

Training

Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:

VideoChat2

bash scripts/videochat2/run.sh

LLaVA-Video

bash scripts/llavavideo/run.sh

LongVU

bash scripts/longvu/run.sh

The link to the base model weights are:

Inference

We provide a simple setup to inference using our trained model.

VideoChat2

bash scripts/inference_videochat2.sh

LLaVA-Video

bash scripts/inference_llavavideo.sh

LongVU

bash scripts/inference_longvu.sh

Results

RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.

Models	#F	TV Bench	Temp Compass	Video Hallucer	Vid Halluc	MV Bench	Video MME	MLVU	LongVideo Bench
VideoChat2	16	44.0	59.3	23.1	73.3	60.2	41.0	46.4	40.4
VideoChat2 + DPO	16	45.7	60.0	22.1	72.4	59.6	43.0	47.4	41.0
VideoChat2 + RRPO	16	45.8	60.2	32.9	76.4	59.0	44.3	47.9	42.8

LLaVA-Video	64	51.0	66.0	50.0	76.6	61.1	64.0	68.6	60.1
LLaVA-Video + DPO	64	51.9	66.4	53.3	76.5	60.6	63.1	67.4	59.4
LLaVA-Video + RRPO	64	51.9	66.8	55.7	76.5	62.2	64.5	69.1	60.4
LLaVA-Video + RRPO (32f)	64	52.2	67.4	55.8	76.6	62.1	64.5	69.4	60.1

LongVU	1fps	53.7	63.9	39.2	67.3	65.5	56.2	63.6	48.6
LongVU + DPO	1fps	54.3	64.3	40.9	68.5	65.9	56.6	63.6	49.4
LongVU + RRPO	1fps	56.5	64.5	44.0	71.7	66.8	57.7	64.5	49.7

Evaluation

You can download evaluation benchmarks from the given links:

Next, you can run the entire evaluations following the instructions provided here.

Citation

If you find this work useful, please consider citing our paper:

@article{sarkar2025rrpo,
  title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
  author={Your Name et al.},
  journal={arXiv preprint arXiv:2504.12083},
  year={2025}
}

Usage and License Notices

This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses. The assets used in this work include, but are not limited to: VideoChat2-IT, VideoChat2_stage3_Mistral_7B, LLaVA-Video-7B-Qwen2, LongVU_Qwen2_7B. This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations. This repository is released under the Apache 2.0 License. See LICENSE for details.

For any issues or questions, please open an issue or contact Pritam Sarkar at [email protected]!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
eval		eval
llava		llava
longvu		longvu
scripts		scripts
scripts_eval		scripts_eval
videochat2		videochat2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
llava.txt		llava.txt
longvu.txt		longvu.txt
sample_video.mp4		sample_video.mp4
train_llavavideo.py		train_llavavideo.py
train_longvu.py		train_longvu.py
train_videochat2.py		train_videochat2.py
videochat2.txt		videochat2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization

Installation

Example: Setting up LLaVA-Video

Models

Dataset

Training

Inference

Results

Evaluation

Citation

Usage and License Notices

About

Releases

Packages

Languages

License

pritamqu/RRPO

Folders and files

Latest commit

History

Repository files navigation

Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization

Installation

Example: Setting up LLaVA-Video

Models

Dataset

Training

Inference

Results

Evaluation

Citation

Usage and License Notices

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages