Skip to content
/ AKeyS Public

Agentic Keyframe Search for Video Question Answering

License

Notifications You must be signed in to change notification settings

fansunqi/AKeyS

Repository files navigation

🌲 AKeyS: Agentic Keyframe Search for Video Question Answering

TODO: add arxiv url

News and Todo 🗓️

  • Release Code for Demo

  • Release Code for EgoSchema

  • Release Code for NExT-QA

Introduction

​ We present Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKEYS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKEYS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree.

Installation Steps 🛠️

Our TreeVideoAgent does not require many computational resources; it can run on a personal computer without GPU.

  1. Clone the repository 📦:

    git clone git@github.com:fansunqi/TreeVideoAgentPublic.git
    cd TreeVideoAgentPublic
  2. Create a virtual environment 🧹 and install the dependencies 🧑‍🍳:

    python3 -m venv tva_env
    source tva_env/bin/activate
    pip install -r requirements.txt
  3. Set up your API key 🗝️:

    Obtain an OpenAI API key and set OPENAI_API_KEY and OPENAI_BASE_URL as environmental variables in ~/.zshrc or ~/.bashrc. In the main.py, we will use the following codes to obtain the API key and base URL:

    api_key = os.getenv("OPENAI_API_KEY")
    base_url = os.getenv("OPENAI_BASE_URL")
    

QuickStart 🚀

We present a case demo by running:

sh scripts/demo.sh

A visualized example:

EgoSchema Experiments 🔬

We obtained the dataset annotations and extracted captions from the File LLoVi provided. We have already placed subset annotations and captions in data/egoschema/.

If you don't want to pay for the OpenAI API, we provide our LLM conversation cache here. You can specify the cache path in arg_parser.py.

For the EgoSchema subset (500 videos), run:

sh scripts/egoschema_subset.sh

For the EgoSchema fullest, download annotations and captions from LLoVi, specify the data path in arg_parser.py, and run:

sh scripts/egoschema_fullset.sh

The code will run an automated evaluation script and output accuracy and mean frame number like this:

For a step-by-step analysis, run:

python3 analyze_results.py --filepath YOUR_RESULT_JSON_FILE_PATH

It will output a histogram showing the number of problems solved and the accuracy at each step like this:

Acknowledgments

We thank the developers of LLoVi, VideoTree, VideoAgent and HCQA for their public code release.

Citation

TODO

About

Agentic Keyframe Search for Video Question Answering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published