A visual speaker-activity extractor for multi-avatar videos. This tool detects when each on-screen avatar is "speaking" based on changes in brightness or motion within defined regions of interest (ROIs). It generates an .ass (Advanced Substation Alpha) subtitle file where the active segments are marked for each speaker.
- Python 3.7+
opencv-python-headlessnumpy
Install the required Python packages:
pip install opencv-python-headless numpyCreate a JSON file (e.g., config.json) defining the speakers and their positions (ROIs) in the video.
Configuration Structure:
{
"source_video": "path/to/video.mp4",
"output_ass": "output.ass",
"threshold_brightness": 50,
"threshold_motion": 1000,
"min_duration_frames": 5,
"merge_gap_frames": 10,
"speakers": [
{
"name": "Alice",
"rect": [100, 100, 150, 150],
"idle_time": 0.5
},
{
"name": "Bob",
"rect": [400, 100, 150, 150],
"idle_time": 2.0
}
]
}Parameters:
source_video: (Optional) Path to the input video. Can be overridden by CLI argument.output_ass: (Optional) Path for the output ASS file. Can be overridden by CLI argument.threshold_brightness: Minimum increase in average brightness relative to the idle state to consider the ROI "active".threshold_motion: Minimum sum of absolute differences between frames to consider the ROI "moving".min_duration_frames: Minimum number of consecutive active frames required to create a segment.merge_gap_frames: Maximum number of inactive frames between segments to merge them into a single continuous segment.speakers: A list of speaker objects.name: The name of the speaker (used for the subtitle style and event).rect: The ROI[x, y, width, height]in pixels.idle_time: Time in seconds where this speaker is known to be idle (dark/static). This is used as a reference to calculate the baseline brightness.
Run the script with the path to your configuration file:
python main.py config.jsonOptional Arguments:
--video: Override the video path specified in the config.--output: Override the output path specified in the config.
python main.py config.json --video my_video.mp4 --output subtitles.assYou can generate a synthetic test video to verify the tool's functionality:
python generate_test_video.pyThis will create test_video.mp4 and test_config.json. You can then run the main tool against these:
python main.py test_config.json --output test.assThe tool generates an .ass file containing:
- Styles: A style defined for each speaker, positioned at their configured ROI.
- Events: Subtitle events with correct start and end times corresponding to when the speaker was detected as active. The text content of the events is empty.