FlowNar: Scalable Streaming Narration for Long-Form Videos

Zeyun Zhong*, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs, Juergen Gall, Juergen Beyerer

Karlsruhe Institute of Technology (KIT) · Fraunhofer IOSB · Lamarr Institute · University of Bonn

🎬 Demo

FlowNar generates dense, temporally grounded narrations for a continuous video stream in real time. Unlike prior work, its VRAM usage stays bounded regardless of video length.

✨ Introduction

Our paper presents a scalable approach for streaming video narration, introducing several interesting features:

Dynamic Context Management (DCM). Prunes the visual KV cache after each narration segment, preventing unbounded context growth and reducing error propagation from potentially misaligned history. It then uses a streaming memory (CLAM) to iteratively compress visual information from processed segments into a fixed-size set of memory tokens, providing O(1) memory usage and per-step computational complexity for historical visual frames.
Self-conditioned evaluation protocol. An autoregressive pipeline where the model generates narrations conditioned on its own previous outputs, enabling deployment-like assessment beyond standard teacher-forcing evaluation.

📂 Repository Structure

FlowNar/
├── configs/
│   └── deepspeed/           # DeepSpeed ZeRO configs
├── data/
│   ├── ego4d/               # Ego4D dataset loaders
│   ├── egoexo/              # EgoExo4D dataset loaders
│   └── ek100/               # EpicKitchens100 dataset loaders
├── engine/
│   ├── trainer_with_gen2eval.py     # Oracle teacher-forcing trainer
│   └── trainer_stream_generate.py  # Self-conditioned streaming trainer
├── models/                  # FlowNar model architecture (CLAM, DCM)
├── scripts/
│   ├── ego4d/               # Training & evaluation scripts — Ego4D
│   ├── egoexo4d/            # Training & evaluation scripts — EgoExo4D
│   └── ek100/               # Training & evaluation scripts — EpicKitchens100
├── train.py                 # Training entry point
├── stream_generate.py       # Self-conditioned generation & evaluation entry point
└── environment.yaml         # Conda environment specification

⚙️ Installation

Requirements: Python 3.10, CUDA 12.1

conda env create -f environment.yaml
conda activate flownar

Key dependencies: PyTorch 2.3.1 · Transformers 4.43.2 · DeepSpeed 0.15.4 · flash-attn 2.5.9

Note: The base LLM is meta-llama/Llama-3.2-1B-Instruct. You must accept the Llama 3.2 license on HuggingFace before downloading the weights.

📦 Data Preparation

Download all dataset features and annotations from HuggingFace:

huggingface-cli download zeyun-zhong/FlowNar-Data \
    --repo-type dataset \
    --local-dir /path/to/FlowNar-Data

After downloading, the directory should have the following structure:

FlowNar-Data/
├── ek100/
│   ├── annotations/
│   ├── features/
│   └── features_metadata.json
├── ego4d/
│   ├── annotations/
│   ├── features/
│   └── features_metadata.json
└── egoexo/
    ├── annotations/
    ├── features/
    └── features_metadata.json

The feature data is distributed as multi-part tar archives, reassemble and extract each split before use:
cat <archive_name>.tar.* | tar -xf -

🤖 Pretrained Models

Model	Pretrained on	HuggingFace
FlowNar-1B	Ego4D	zeyun-zhong/flownar-1B-ego4d
FlowNar-1B	EgoExo4D	zeyun-zhong/flownar-1B-egoexo4d
FlowNar-1B	EpicKitchens100	zeyun-zhong/flownar-1B-ek100

These checkpoints are used directly with --resume_from_checkpoint in the commands below.

🚀 Training and Evaluation

Hardware

Training: 4× NVIDIA H100 (80 GB), DeepSpeed ZeRO-2
Evaluation: 4× NVIDIA H100

Quick Debug (No Real Data Required)

To verify the setup before running full experiments, enable local_debug mode — video features are replaced with random vectors:

deepspeed train.py \
    --live_version live1+ \
    --train_datasets ek100_refined_narration_stream_train \
    --llm_pretrained meta-llama/Llama-3.2-1B-Instruct \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --bf16 True \
    --tf32 True \
    --data_root /path/to/FlowNar-Data/ek100 \
    --local_debug True

Oracle Teacher-Forcing Protocol

Train on EpicKitchens100 (fine-tuning from an Ego4D pretrained checkpoint):

deepspeed train.py --deepspeed configs/deepspeed/zero2.json \
    --live_version live1+ \
    --train_datasets ek100_refined_narration_stream_train \
    --eval_datasets ek100_refined_narration_stream_val \
    --num_train_epochs 4 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --gradient_checkpointing True \
    --evaluation_strategy no \
    --save_strategy no \
    --learning_rate 0.0002 \
    --optim adamw_torch \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --bf16 True \
    --data_root /path/to/FlowNar-Data/ek100 \
    --resume_from_checkpoint zeyun-zhong/flownar-1B-ego4d \
    --llm_pretrained meta-llama/Llama-3.2-1B-Instruct \
    --vision_mask True \
    --enable_vision_memory True \
    --num_m_tokens 20 \
    --finetune_modules connector clustering \
    --finetune_downstream True \
    --output_dir outputs/ek100_flownar_1B

For Ego4D and EgoExo4D, refer to the corresponding scripts in scripts/ego4d/narration/ and scripts/egoexo4d/narration/, adjusting --data_root to the respective dataset path.

Self-Conditioned Protocol

Run streaming generation and evaluation on EpicKitchens100 using a pretrained checkpoint:

deepspeed stream_generate.py \
    --live_version live1+ \
    --eval_datasets ek100_segment_summary_val \
    --per_device_eval_batch_size 1 \
    --evaluation_strategy no \
    --bf16 True \
    --data_root /path/to/FlowNar-Data/ek100 \
    --resume_from_checkpoint zeyun-zhong/flownar-1B-ek100 \
    --llm_pretrained meta-llama/Llama-3.2-1B-Instruct \
    --vision_mask True \
    --enable_vision_memory True \
    --num_m_tokens 20 \
    --finetune_modules connector clustering \
    --output_dir outputs/flownar_1B_ek100

For other datasets, use the scripts in scripts/ego4d/stream_generate/ and scripts/egoexo4d/stream_generate/.

🙏 Acknowledgement

This work builds upon VideoLLM-Online. We thank the authors for their excellent open-source implementation.

📑 Citation

If you find this work useful, please consider citing:

@inproceedings{zhong2026flownar,
  title     = {{FlowNar}: Scalable Streaming Narration for Long-Form Videos},
  author    = {Zhong, Zeyun and Martin, Manuel and Wu, Chengzhi and Schneider, David and
               Diederichs, Frederik and Gall, Juergen and Beyerer, Juergen},
  booktitle = {International Conference on Machine Learning},
  year      = {2026},
  publisher = {PMLR},
  note      = {Accepted, to appear}
}

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
configs/deepspeed		configs/deepspeed
data		data
engine		engine
models		models
pycocoevalcap		pycocoevalcap
scripts		scripts
webpage		webpage
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
evaluate_metrics.py		evaluate_metrics.py
evaluate_ori.py		evaluate_ori.py
stream_generate.py		stream_generate.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlowNar: Scalable Streaming Narration for Long-Form Videos

🎬 Demo

✨ Introduction

📂 Repository Structure

⚙️ Installation

📦 Data Preparation

🤖 Pretrained Models

🚀 Training and Evaluation

Hardware

Quick Debug (No Real Data Required)

Oracle Teacher-Forcing Protocol

Self-Conditioned Protocol

🙏 Acknowledgement

📑 Citation

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FlowNar: Scalable Streaming Narration for Long-Form Videos

🎬 Demo

✨ Introduction

📂 Repository Structure

⚙️ Installation

📦 Data Preparation

🤖 Pretrained Models

🚀 Training and Evaluation

Hardware

Quick Debug (No Real Data Required)

Oracle Teacher-Forcing Protocol

Self-Conditioned Protocol

🙏 Acknowledgement

📑 Citation

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages