Skip to content

caca2331/talker-tracker

Repository files navigation

talker-tracker

A visual speaker-activity extractor for multi-avatar videos. This tool detects when each on-screen avatar is "speaking" based on changes in brightness or motion within defined regions of interest (ROIs). It generates an .ass (Advanced Substation Alpha) subtitle file where the active segments are marked for each speaker.

Prerequisites

  • Python 3.7+
  • opencv-python-headless
  • numpy

Installation

Install the required Python packages:

pip install opencv-python-headless numpy

Usage

1. Create a Configuration File

Create a JSON file (e.g., config.json) defining the speakers and their positions (ROIs) in the video.

Configuration Structure:

{
  "source_video": "path/to/video.mp4",
  "output_ass": "output.ass",
  "threshold_brightness": 50,
  "threshold_motion": 1000,
  "min_duration_frames": 5,
  "merge_gap_frames": 10,
  "speakers": [
    {
      "name": "Alice",
      "rect": [100, 100, 150, 150],
      "idle_time": 0.5
    },
    {
      "name": "Bob",
      "rect": [400, 100, 150, 150],
      "idle_time": 2.0
    }
  ]
}

Parameters:

  • source_video: (Optional) Path to the input video. Can be overridden by CLI argument.
  • output_ass: (Optional) Path for the output ASS file. Can be overridden by CLI argument.
  • threshold_brightness: Minimum increase in average brightness relative to the idle state to consider the ROI "active".
  • threshold_motion: Minimum sum of absolute differences between frames to consider the ROI "moving".
  • min_duration_frames: Minimum number of consecutive active frames required to create a segment.
  • merge_gap_frames: Maximum number of inactive frames between segments to merge them into a single continuous segment.
  • speakers: A list of speaker objects.
    • name: The name of the speaker (used for the subtitle style and event).
    • rect: The ROI [x, y, width, height] in pixels.
    • idle_time: Time in seconds where this speaker is known to be idle (dark/static). This is used as a reference to calculate the baseline brightness.

2. Run the Tool

Run the script with the path to your configuration file:

python main.py config.json

Optional Arguments:

  • --video: Override the video path specified in the config.
  • --output: Override the output path specified in the config.
python main.py config.json --video my_video.mp4 --output subtitles.ass

3. Generate Synthetic Test Video

You can generate a synthetic test video to verify the tool's functionality:

python generate_test_video.py

This will create test_video.mp4 and test_config.json. You can then run the main tool against these:

python main.py test_config.json --output test.ass

Output

The tool generates an .ass file containing:

  • Styles: A style defined for each speaker, positioned at their configured ROI.
  • Events: Subtitle events with correct start and end times corresponding to when the speaker was detected as active. The text content of the events is empty.

About

A visual speaker-activity extractor for multi-avatar videos. Detects when each on-screen avatar is “speaking” based on ROI brightness/motion. Generates an empty ASS subtitle file.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages