Skip to content

feat: add human-centric video understanding operators for HumanVBench#938

Open
SYSUzhouting wants to merge 1 commit intomainfrom
dev/humanvbench_ops
Open

feat: add human-centric video understanding operators for HumanVBench#938
SYSUzhouting wants to merge 1 commit intomainfrom
dev/humanvbench_ops

Conversation

@SYSUzhouting
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the video understanding capabilities of the system by introducing a suite of human-centric operators tailored for the HumanVBench project. It integrates advanced third-party models for detailed analysis of human presence, speech, and actions within videos, enabling more nuanced data processing for tasks focused on human interaction. The changes also include crucial fixes for batch processing and refinements in data handling, alongside updated documentation to guide users through the new features.

Highlights

  • New Human-Centric Video Understanding Operators: Introduced a comprehensive suite of new operators designed for human-centric video understanding, including tools for tracking human faces and bodies, detecting active speakers, analyzing facial demographics and emotions, and generating captions focused on human activity.
  • Integration of Third-Party Models: Integrated several external models such as YOLOv8 for human detection, Light-ASD for active speaker detection, and SenseVoice for audio processing, managed via Git submodules and custom patches to enhance video analysis capabilities.
  • Enhanced Video Filtering: Added a new VideoFaceRatioFilter to enable filtering of video samples based on the proportion of frames containing faces, ensuring a focus on human-centric content.
  • Refined Data Loading and Formatting: Streamlined JSON data loading by removing support for .gz and .zst compressed JSON files from the core data loading strategy and JSON formatter, simplifying file handling.
  • Batch Processing Fix for NLP Augmentation: Corrected a bug in the nlpaug_en_mapper to ensure that all samples within a batch are properly augmented, rather than only the first one, improving data augmentation reliability.
  • Documentation and Demo Updates: Provided new README files (English and Chinese) with quick start guides for the HumanVBench operators and added a new demo configuration (analyzer.yaml) and sample video dataset (demo-dataset-videos2.jsonl) to showcase the new functionalities.
  • Removal of LaTeX-Related Operators and Documentation: Removed LatexFigureContextExtractorMapper and LatexMergeTexMapper along with their associated documentation, indicating a shift in focus or deprecation of these features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • README_this_pr.md
    • Added instructions for using HumanVBench operators.
  • README_this_pr_CH.md
    • Added Chinese instructions for using HumanVBench operators.
  • data_juicer/config/config_all.yaml
    • Removed LaTeX-related mappers.
    • Added new HumanVBench video mappers and a video face ratio filter.
  • data_juicer/core/data/load_strategy.py
    • Removed support for '.gz' and '.zst' compressed JSON files in the data loading strategy.
  • data_juicer/core/data/ray_dataset.py
    • Removed support for '.gz' and '.zst' compressed JSON files in Ray dataset reading.
  • data_juicer/format/json_formatter.py
    • Removed '.json.gz', '.jsonl.gz', and '.json.zst' from the list of supported JSON suffixes.
  • data_juicer/ops/filter/init.py
    • Imported and added 'VideoFaceRatioFilter' to the list of available filters.
  • data_juicer/ops/filter/image_face_count_filter.py
    • Updated the logic for assigning extra keyword arguments.
    • Added exception handling around face detection to prevent crashes.
  • data_juicer/ops/filter/image_face_ratio_filter.py
    • Updated the logic for assigning extra keyword arguments.
  • data_juicer/ops/filter/video_face_ratio_filter.py
    • Added a new filter to retain video samples based on the ratio of frames containing faces.
  • data_juicer/ops/filter/video_motion_score_filter.py
    • Updated the logic for assigning extra keyword arguments.
  • data_juicer/ops/mapper/init.py
    • Removed LaTeX-related mappers.
    • Imported new HumanVBench video understanding mappers.
  • data_juicer/ops/mapper/image_face_blur_mapper.py
    • Updated the logic for assigning extra keyword arguments.
  • data_juicer/ops/mapper/latex_figure_context_extractor_mapper.py
    • Removed this file.
  • data_juicer/ops/mapper/latex_merge_tex_mapper.py
    • Removed this file.
  • data_juicer/ops/mapper/nlpaug_en_mapper.py
    • Corrected the batch processing logic to ensure all samples in a batch are augmented.
  • data_juicer/ops/mapper/video_active_speaker_detect_mapper.py
    • Added a new mapper for detecting active speakers in videos, including consistency checks.
  • data_juicer/ops/mapper/video_audio_ASR_mapper.py
    • Added a new mapper for automatic speech recognition from video audio streams.
  • data_juicer/ops/mapper/video_audio_detect_age_gender_mapper.py
    • Added a new mapper to detect age and gender from speech in video audio.
  • data_juicer/ops/mapper/video_audio_speech_emotion_mapper.py
    • Added a new mapper for speech emotion recognition from video audio.
  • data_juicer/ops/mapper/video_captioning_face_attribute_emotion_mapper.py
    • Added a new mapper to generate captions describing facial attributes and emotions from human tracks in videos.
  • data_juicer/ops/mapper/video_captioning_from_human_tracks_mapper.py
    • Added a new mapper to generate captions and identify children based on human tracks in videos.
  • data_juicer/ops/mapper/video_face_blur_mapper.py
    • Updated the logic for assigning extra keyword arguments.
  • data_juicer/ops/mapper/video_human_tracks_extraction_mapper.py
    • Added a new mapper to extract and process face and human bounding box tracks from videos.
  • data_juicer/ops/mapper/video_human_tracks_face_demographic_mapper.py
    • Added a new mapper to extract facial demographics (age, gender, race) from human tracks in videos.
  • data_juicer/utils/ASD_mapper_utils.py
    • Added utility functions for video processing, including scene detection, face detection, face tracking, human bounding box finding, and video annotation.
  • data_juicer/utils/constant.py
    • Added new MetaKeys for HumanVBench-related attributes (e.g., active speaker, audio speech attributes, human track data).
    • Added a new StatsKey for 'video_face_exist'.
  • data_juicer/utils/file_utils.py
    • Removed specific handling for '.gz' compressed files.
  • data_juicer/utils/model_utils.py
    • Added functions to prepare and load models for SenseVoiceSmall, Light-ASD, YOLOv8_human, face_detect_S3FD, and wav2vec2_age_gender.
  • demos/data/demo-dataset-videos2.jsonl
    • Added a new demo dataset file containing video paths.
  • demos/video_humanvbench_simple/analyzer.yaml
    • Added a new demo configuration file for HumanVBench video processing pipeline.
  • docs/Cache.md
    • Removed this documentation file.
  • docs/Cache_ZH.md
    • Removed this Chinese documentation file.
  • docs/Export.md
    • Removed this documentation file.
  • docs/Export_ZH.md
    • Removed this Chinese documentation file.
  • docs/Operators.md
    • Updated the count of mapper operators.
    • Removed entries for LaTeX-related mappers.
  • docs/Tracing.md
    • Removed this documentation file.
  • docs/Tracing_ZH.md
    • Removed this Chinese documentation file.
  • docs/operators/mapper/latex_figure_context_extractor_mapper.md
    • Removed this documentation file.
  • docs/operators/mapper/latex_merge_tex_mapper.md
    • Removed this documentation file.
  • tests/format/test_json_formatter.py
    • Removed tests related to '.gz' and '.zst' compressed JSON files.
  • tests/ops/mapper/test_latex_figure_context_extractor_mapper.py
    • Removed this test file.
  • tests/ops/mapper/test_latex_merge_tex_mapper.py
    • Removed this test file.
  • tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py
    • Removed this test file, likely due to the batch processing bug being resolved.
  • tests/utils/test_file_utils.py
    • Removed tests related to '.gzip' file handling.
  • thirdparty/humanvbench_models/.gitmodules
    • Added Git submodules for YOLOv8_human, Light-ASD, and SenseVoice.
  • thirdparty/humanvbench_models/Light-ASD_changes.diff
    • Added a patch file for the Light-ASD repository.
  • thirdparty/humanvbench_models/README.md
    • Added a README file detailing the setup process for HumanVBench models, including submodule initialization and patch application.
  • thirdparty/humanvbench_models/SenseVoice_changes.diff
    • Added a patch file for the SenseVoice repository.
  • thirdparty/humanvbench_models/YOLOv8_human_changes.diff
    • Added a patch file for the YOLOv8_human repository.
  • thirdparty/humanvbench_models/audio_code/wav2vec_age_gender.py
    • Added a new Python utility for age and gender detection using a Wav2Vec2 model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of new operators for human-centric video understanding, which is a significant feature addition. However, the review has identified several critical issues that need to be addressed. These include a regression in nlpaug_en_mapper that breaks batch processing, a critical bug in how several operators handle shared keyword arguments, and incorrect method signatures in the new mapper classes that will lead to runtime errors. Additionally, there are issues with temporary file handling, code duplication, and some potentially breaking changes regarding compressed file support that seem unrelated to the main feature. I recommend addressing these high-priority issues before merging.

for key in samples:
if key != self.text_key:
res_samples[key] += [samples[key][idx]] * len(sample_texts)
texts_to_aug = samples[self.text_key][0] # batch_size = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change introduces a regression that breaks batch processing. The new implementation processes only the first sample of a batch (samples[self.text_key][0]), whereas the previous implementation correctly iterated over all samples. This will cause all but the first sample in a batch to be skipped during augmentation. The removal of tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py, which appears to have tested for this exact scenario, is also concerning.

Please revert to a batch processing logic that iterates over all samples in the batch, similar to the previous implementation.

Comment on lines +70 to +73
self.extra_kwargs = self._default_kwargs
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Assigning self.extra_kwargs = self._default_kwargs creates a reference to the mutable class attribute _default_kwargs, not a copy. This means that if multiple instances of this operator are created, they will all share and modify the same dictionary, leading to unexpected and hard-to-debug behavior. This should be self.extra_kwargs = self._default_kwargs.copy() to ensure each instance has its own copy.

This issue is also present in other files in this PR, including:

  • data_juicer/ops/filter/image_face_ratio_filter.py
  • data_juicer/ops/filter/video_motion_score_filter.py
  • data_juicer/ops/mapper/image_face_blur_mapper.py
  • data_juicer/ops/mapper/video_face_blur_mapper.py
Suggested change
self.extra_kwargs = self._default_kwargs
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
self.extra_kwargs = self._default_kwargs.copy()
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]

# Use ray.data functions directly with PyArrow filesystem support
# Ray's read functions support filesystem parameter via PyArrow
if data_format in {"json", "jsonl", "json.gz", "jsonl.gz", "json.zst", "jsonl.zst"}:
if data_format in {"json", "jsonl"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This PR seems to remove support for compressed json files (.gz, .zst) in several places, which is a significant breaking change and seems unrelated to the main goal of adding video operators. The changes are also inconsistent across the codebase. For example, load_strategy.py and ray_dataset.py remove support for .jsonl.zst, but data_juicer/format/json_formatter.py retains it. Could you clarify if removing compressed file support is intended? If so, the implementation should be consistent across the codebase.




def process_single(self, samples, rank=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This mapper is decorated with _batched_op = True, but it implements process_single instead of process_batched. The Mapper.process method will attempt to call process_batched(self, samples), which will fail due to the method name mismatch and different signature (process_single(self, samples, rank=None)). This will cause a TypeError at runtime.

Please rename the method to process_batched and adjust its signature to match the base class. This issue exists in several other new mappers in this PR as well (e.g., video_captioning_from_human_tracks_mapper, video_human_tracks_face_demographic_mapper).

Comment on lines +153 to +159
temp_dir = tempfile.mkdtemp(dir=self.temp_save_path)
pyaviPath = os.path.join(temp_dir, 'pyavi')
pyframesPath = os.path.join(temp_dir, 'pyframes')
pyworkPath = os.path.join(temp_dir, 'pywork')
pycropPath = os.path.join(temp_dir, 'pycrop')
if os.path.exists(temp_dir):
rmtree(temp_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for handling temporary directories appears to be flawed. A temporary directory is created with tempfile.mkdtemp, then immediately removed with rmtree if it exists (which it always will). Subsequent calls to os.makedirs for subdirectories within this now-deleted path will fail with a FileNotFoundError. The rmtree call should be placed in a finally block to ensure cleanup after the directory has been used.

Comment on lines +120 to +122
from .video_captioning_face_attribute_emotion_mapper import VideoCaptioningFaceAttributeEmotionMapper
from .video_captioning_from_human_tracks_mapper import VideoCaptioningFromHumanTracksMapper
from .video_captioning_face_attribute_emotion_mapper import VideoCaptioningFaceAttributeEmotionMapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

VideoCaptioningFaceAttributeEmotionMapper is imported twice. This duplication should be removed for code cleanliness. This is also reflected in the __all__ list (lines 234 and 236) and in data_juicer/config/config_all.yaml.

Suggested change
from .video_captioning_face_attribute_emotion_mapper import VideoCaptioningFaceAttributeEmotionMapper
from .video_captioning_from_human_tracks_mapper import VideoCaptioningFromHumanTracksMapper
from .video_captioning_face_attribute_emotion_mapper import VideoCaptioningFaceAttributeEmotionMapper
from .video_captioning_face_attribute_emotion_mapper import VideoCaptioningFaceAttributeEmotionMapper
from .video_captioning_from_human_tracks_mapper import VideoCaptioningFromHumanTracksMapper


class VideoFaceRatioFilter(Filter):
"""
Keep data samples whose videos' durations are within a specified range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for this filter seems to be a copy-paste from another operator. It states "Keep data samples whose videos' durations are within a specified range," but this filter operates on face ratios, not durations. Please update the docstring to accurately describe the filter's functionality.

Suggested change
Keep data samples whose videos' durations are within a specified range.
Keep data samples whose videos' face-to-frame ratios are within a specified range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant