OneVision Encoder - LLaVA Next

This repository contains the LLaVA-Next implementation for OneVision Encoder models with codec-based video understanding.

Quick Start
- Docker Setup (Recommended)
Training Data Preparation
- Data Format
- Conversion Pipeline
Evaluation

Quick Start

1. 🐳 Docker (Recommended)

We strongly recommend using the Docker environment for a seamless experience. The following instructions are tailored for the A100 80GB GPU environment.

# Clone repository
git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder
cd OneVision-Encoder/llava_next

# Build Docker image
docker build -t ov_encoder_llava:26.01 .

# Run container
docker run -it --gpus all \
    --ipc host --net host --privileged --cap-add IPC_LOCK \
    --ulimit memlock=-1 --ulimit stack=67108864 --rm \
    -v $(pwd):/workspace/OV-Encoder-Llava \
    -w /workspace/OV-Encoder-Llava \
    --name "ov_encoder_llava_container" \
    ov_encoder_llava:26.01 bash -c "service ssh restart; bash"

Training Data Preparation

Training data for codec mode requires precomputed visual assets (mosaic images + position indices). Each training sample contains:

Pre-extracted frame images (e.g., 8 frames per video)
Position indices file (positions_thw.npy) encoding temporal-height-width coordinates

Data Format

Original (Raw) Video Format

Raw video training data uses JSON array format with direct video paths:

[
  {
    "id": "YVQwAEKZpaU",
    "conversations": [
      {"from": "human", "value": "<video>\nWhat is the background setting?"},
      {"from": "gpt", "value": "A clear blue sky with spectators."}
    ],
    "video": "/path/to/videos/ytb_YVQwAEKZpaU.mp4"
  }
]

Converted Codec Format (JSONL)

Each line in the training JSONL should follow this format:

{
  "id": "sample_unique_id",
  "conversations": [
    {"from": "human", "value": "<image>\n<image>\n<image>\n<image>\n<image>\n<image>\n<image>\n<image>\nYour question here?"},
    {"from": "gpt", "value": "Model response here."}
  ],
  "image": [
    "/path/to/frame_000.jpg",
    "/path/to/frame_001.jpg",
    "/path/to/frame_002.jpg",
    "/path/to/frame_003.jpg",
    "/path/to/frame_004.jpg",
    "/path/to/frame_005.jpg",
    "/path/to/frame_006.jpg",
    "/path/to/frame_007.jpg"
  ],
  "positions_thw": "/path/to/positions_thw.npy"
}

Format Comparison

Field	Raw Format	Codec Format
Visual token	`<video>` (single)	`<image>` × N (one per frame)
Visual path	`video`: single mp4 path	`image`: list of frame paths
Position info	Not required	`positions_thw`: npy file path
File format	JSON array	JSONL (one sample per line)

Key Fields

Field	Description
`id`	Unique sample identifier
`conversations`	Multi-turn conversation in human/gpt format
`image`	List of frame image paths (8 frames for codec mode)
`positions_thw`	Path to numpy file containing patch position indices

Note: The number of <image> tokens in the conversation must match the number of images in the image list.

Position Indices Format (`positions_thw.npy`)

The positions_thw.npy file contains patch position coordinates:

Property	Description
Shape	`[num_patches, 3]` where each row is `[t, h, w]`
Dtype	`int32`
Coordinates	`t`: temporal index, `h`: height position, `w`: width position

Example: For 8 frames with 36×36 patches each → shape [10368, 3]

import numpy as np
positions = np.load("positions_thw.npy")
# positions.shape = (10368, 3)
# positions[:5] = [[0,0,0], [0,0,1], [0,0,2], [0,0,3], [0,0,4]]

Mixed Training Data

You can mix video (codec) and image data in the same JSONL. For image-only samples:

{
  "id": "image_sample_id",
  "conversations": [
    {"from": "human", "value": "<image>\nDescribe this image."},
    {"from": "gpt", "value": "Description here."}
  ],
  "image": "/path/to/single_image.jpg"
}

Image samples do not require the positions_thw field.

Directory Structure

training_data_root/
├── images/
│   └── shard00/
│       └── sample_<unique_key>/
│           ├── video_000.jpg
│           ├── video_001.jpg
│           ├── ...
│           ├── video_007.jpg
│           └── positions_thw.npy
└── train.jsonl

Training Data Conversion Pipeline

Convert raw video data to codec format using the two-stage pipeline:

Raw Video Data (JSON with <video> token)
        ↓
Stage 1: Extract codec info (MV/Residual energy) → visidx_thw.npy, frame_ids.npy
        ↓
Stage 2: Pack frames into 8 images → positions_thw.npy, training.jsonl (with <image> tokens)

Quick Start (Demo)

cd llava_next

# Run the complete pipeline with sample videos
bash examples/training_data_demo/run_training_data_pipeline.sh

Manual Execution

Stage 1: Extract codec information

python Compressed_Video_Reader/tool/stage1.py \
    --dataset_path /path/to/raw_videos.json \
    --out_root /path/to/stage1_output \
    --sequence_length 64 \
    --keep_frames_equiv 8 \
    --square_size 576 \
    --patch_size 16 \
    --num_workers 8 \
    --keep_first_full_frame \
    --padding_policy zero

Stage 2: Pack frames and generate training JSONL

python Compressed_Video_Reader/tool/stage2.py \
    --mode pack \
    --input_dataset /path/to/raw_videos.json \
    --out_jsonl /path/to/training_codec.jsonl \
    --visidx_root /path/to/stage1_output \
    --out_image_root /path/to/stage2_images \
    --num_images 8 \
    --square_size 576 \
    --T 64 \
    --patch 16 \
    --write_positions \
    --num_workers 8 \
    --first_full

Pipeline Parameters

Parameter	Stage	Description
`--sequence_length` / `--T`	1 & 2	Number of frames for codec analysis (default: 64)
`--keep_frames_equiv` / `--num_images`	1 & 2	Number of output images per video (default: 8)
`--square_size`	1 & 2	Image size (default: 576)
`--patch_size` / `--patch`	1 & 2	Patch size for position encoding (default: 16)
`--keep_first_full_frame`	1	Keep first frame as complete I-frame (recommended)
`--padding_policy`	1	How to handle empty patches: `zero` or `repeat`
`--first_full`	2	Corresponding flag when using `--keep_first_full_frame`
`--write_positions`	2	Generate `positions_thw.npy` files
`--num_workers`	1 & 2	Parallel processing workers

Output Structure

stage2_output/
├── sample_<id>__<video_stem>__<hash>/
│   ├── video_000.jpg ~ video_007.jpg
│   └── positions_thw.npy
└── ...
training_codec.jsonl

Evaluation

Preparing Offline Codec Assets for Evaluation

For video evaluations using codec mode, precompute visual assets before running evaluation.

Quick Start

cd llava_next

# Preprocess a single benchmark (auto-downloads data if needed)
bash scripts/precompute_codec_patch/preprocess_video_benchmark.sh videomme
bash scripts/precompute_codec_patch/preprocess_video_benchmark.sh mvbench
bash scripts/precompute_codec_patch/preprocess_video_benchmark.sh perceptiontest

# Or preprocess all supported benchmarks
bash scripts/precompute_codec_patch/preprocess_video_benchmark.sh all

Supported Benchmarks

Task Name	lmms-eval Task	Description
`videomme`	videomme	Video-MME benchmark
`mvbench`	mvbench	MVBench benchmark
`perceptiontest`	perceptiontest_val_mc	PerceptionTest Val
`nextqa`	nextqa_mc_test	NExTQA benchmark
`temporalbench`	temporalbench_long_qa	TemporalBench
`video_mmmu`	video_mmmu	Video-MMMU
`tomato`	tomato	TOMATO benchmark
`longvideobench`	longvideobench_val_v	LongVideoBench

Prerequisites (Gated Datasets)

Some datasets require HuggingFace authentication:

# Login to Hugging Face (one-time setup)
huggingface-cli login

# Accept dataset terms on HuggingFace website if required

Output Structure

.huggingface_cache/
├── mvbench_video/              # lmms-eval video cache (auto-downloaded)
│   └── *.mp4
├── mvbench_offline/            # Precomputed offline assets
│   ├── mvbench_videos.jsonl
│   └── assets/
│       └── <video_stem>/
│           ├── mosaic_000.jpg ~ mosaic_007.jpg
│           ├── positions_thw.npy
│           └── meta.json

Running Evaluation

Using Local Script (Recommended)

The local eval script auto-detects offline assets based on task:

bash scripts/eval/local_eval_ov_encoder.sh

Manual Environment Variables

export LLAVA_CODEC_USE_OFFLINE=1
export LLAVA_CODEC_OFFLINE_ROOT=$(pwd)/.huggingface_cache/<task>_offline/assets
export LLAVA_CODEC_VISIDX_MODE=pack_topk
export LLAVA_CODEC_SEQ_LEN_FRAMES=64
export LLAVA_CODEC_NUM_IMAGES=8
export LLAVA_CODEC_SQUARE_SIZE=576
export LLAVA_CODEC_PATCH_SIZE=16

bash scripts/eval/eval_ov_encoder.sh

Codec Parameters

Parameter	Value	Description
`SEQ_LEN_FRAMES`	64	Number of frames for codec analysis
`NUM_IMAGES`	8	Number of output mosaic images per video
`SQUARE_SIZE`	576	Image size (576×576)
`PATCH_SIZE`	16	Patch size for position encoding

Troubleshooting

If evaluation shows MISS (fallback to frame extraction):

Check offline root path: LLAVA_CODEC_OFFLINE_ROOT should point to assets/ directory
Check video key matching: The <video_stem> folder name must match what the model expects
Verify files exist: mosaic_000.jpg, positions_thw.npy, meta.json should be present
Check codec parameters: Ensure precompute and eval use the same parameters

Manual Preprocessing (Advanced)

For custom datasets or fine-grained control:

# 1. Prepare input JSONL with video paths and unique keys
# Each line: {"video": "/path/to/video.mp4", "key": "unique_id", ...}

# 2. Run offline precompute
python Compressed_Video_Reader/tool/offline_precompute_llava_codec_assets.py \
    --jsonl path/to/eval_videos.jsonl \
    --out_root path/to/offline_root \
    --num_workers 8 \
    --seq_len_frames 64 \
    --num_images 8 \
    --square_size 576 \
    --patch_size 16

# Optional: sharding for large datasets
python Compressed_Video_Reader/tool/offline_precompute_llava_codec_assets.py \
    --jsonl path/to/eval_videos.jsonl \
    --out_root path/to/offline_root \
    --num_shards 8 --shard_id 0

License

This project is licensed under the Apache 2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneVision Encoder - LLaVA Next

Table of Contents

Quick Start

1. 🐳 Docker (Recommended)

Training Data Preparation

Data Format

Original (Raw) Video Format

Converted Codec Format (JSONL)

Format Comparison

Key Fields

Position Indices Format (`positions_thw.npy`)

Mixed Training Data

Directory Structure

Training Data Conversion Pipeline

Quick Start (Demo)

Manual Execution

Pipeline Parameters

Output Structure

Evaluation

Preparing Offline Codec Assets for Evaluation

Quick Start

Supported Benchmarks

Prerequisites (Gated Datasets)

Output Structure

Running Evaluation

Using Local Script (Recommended)

Manual Environment Variables

Codec Parameters

Troubleshooting

Manual Preprocessing (Advanced)

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

OneVision Encoder - LLaVA Next

Table of Contents

Quick Start

1. 🐳 Docker (Recommended)

Training Data Preparation

Data Format

Original (Raw) Video Format

Converted Codec Format (JSONL)

Format Comparison

Key Fields

Position Indices Format (positions_thw.npy)

Mixed Training Data

Directory Structure

Training Data Conversion Pipeline

Quick Start (Demo)

Manual Execution

Pipeline Parameters

Output Structure

Evaluation

Preparing Offline Codec Assets for Evaluation

Quick Start

Supported Benchmarks

Prerequisites (Gated Datasets)

Output Structure

Running Evaluation

Using Local Script (Recommended)

Manual Environment Variables

Codec Parameters

Troubleshooting

Manual Preprocessing (Advanced)

License

Position Indices Format (`positions_thw.npy`)