Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize honeypot selection algorithm #8857

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Dec 21, 2024

Motivation and context

Improved performance of the honeypot selection/update algorithm, noticeable on tasks with ~50k frames.

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)
  • I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

  • New Features

    • Introduced a new attribute for improved frame selection in job validation.
    • Added a new class for managing item counts, enhancing frame selection logic.
  • Bug Fixes

    • Adjusted frame selection logic for better efficiency and accuracy.
  • Documentation

    • Updated comments and documentation to reflect new logic and attributes.

Copy link
Contributor

coderabbitai bot commented Dec 21, 2024

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes focus on enhancing the frame selection mechanism for honeypot validation in the CVAT (Computer Vision Annotation Tool) project. A new _BaggedCounter class is introduced in the task_validation.py file to manage item counts more efficiently. The serializers in serializers.py are updated to incorporate a new honeypot_frame_selector that uses this counter to select frames for validation in a more structured and reproducible manner.

Changes

File Changes
cvat/apps/engine/serializers.py - Added honeypot_frame_selector attribute to _TaskValidationLayoutBulkUpdateContext
- Modified update methods in JobValidationLayoutWriteSerializer and TaskValidationLayoutWriteSerializer to use new frame selection logic
cvat/apps/engine/task_validation.py - Added new _BaggedCounter class with methods for creating and managing counted items
- Updated HoneypotFrameSelector to use _BaggedCounter for frame selection
- Improved frame selection reproducibility

Sequence Diagram

sequenceDiagram
    participant Serializer as Task/Job Serializer
    participant Selector as HoneypotFrameSelector
    participant Counter as _BaggedCounter

    Serializer->>Selector: Create with validation frame counts
    Selector->>Counter: Initialize from frame counts
    Serializer->>Selector: select_next_frames(count)
    Selector->>Counter: Use shuffle and select methods
    Counter-->>Selector: Return selected frames
    Selector-->>Serializer: Provide selected frames
Loading

Poem

🐰 Hop, hop, through frames we go,
With counters smart and logic's flow
Honeypot selection, now precise
Random yet balanced, oh so nice!
CodeRabbit's magic makes it right 🎲


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cvat/apps/engine/serializers.py (1)

Line range hint 1104-1115: Consider adding input validation for negative counts passed to select_next_frames

Within JobValidationLayoutWriteSerializer, the code calls frame_selector.select_next_frames(segment_honeypots_count). If segment_honeypots_count is ever 0 or negative (due to incomplete cleanup or counts mismatch), an exception or unexpected behavior might result. Adding a quick check to ensure a nonnegative count might help.

+ if segment_honeypots_count <= 0:
+     raise serializers.ValidationError("No available honeypots to select.")
🧹 Nitpick comments (7)
cvat/apps/engine/task_validation.py (5)

16-24: Consider renaming _BaggedCounter to BaggedCounter for clarity

The underscore prefix conventionally denotes a private or internal usage. However, this class is being used as a key part of the honeypot selection feature, making it relatively pivotal to the design. If it is intended to be widely used, renaming it to BaggedCounter (without the underscore) may make the intent clearer.


45-53: Ensure deterministic seeding behavior for shuffle

If the shuffle is intended to be reproducible given a seed, ensure the random generator is consistently seeded outside of the function if it needs to be stable. If multiple calls to shuffle() share the same rng, verify that it will produce the intended random distribution across calls.


82-84: Check concurrency when updating item usage

If multiple threads or processes use HoneypotFrameSelector, the shared data structures in _BaggedCounter could become corrupted during the pop() and push operations in use_item(). Ensuring thread/process safety (e.g., using locks, or restricting concurrency) might be necessary if usage can happen concurrently.


88-102: Improve error handling or logging for missing frames in HoneypotFrameSelector

Currently, the constructor ensures that frames are constructed from a valid dictionary. However, if the dictionary is missing certain keys or if frames are outside the expected domain, there is no explicit exception or log message before continuing. Consider adding raised exceptions or logs to make debugging easier if the dictionary is incomplete.


103-115: Document the assumptions in select_next_frames

The docstring mentions that certain properties (e.g., no repetition, uniform usage) are guaranteed. It would help to clarify that only frames known to the _counter can be selected and how the function responds when the requested count exceeds the number of available frames.

cvat/apps/engine/serializers.py (2)

1489-1491: Check if frames are disabled or removed mid-process

When re-selecting frames in the “RANDOM_UNIFORM” method, the code calls HoneypotFrameSelector. If some frames become disabled (i.e., removed from active_validation_frames) after random selection, the internal logic might be incomplete or lead to an inconsistent final distribution. Validate if you need to re-check frame status or re-run the selection in that scenario.

Would you like me to open a new GitHub issue to propose an approach for re-validating frames in real time?


Line range hint 1104-1115: Suggest unit tests covering corner cases

No direct unit test references appear here. We recommend adding tests around these lines to cover:
• Zero or minimal values for honeypot frames.
• Attempted usage with an empty active_validation_frames.
• Large or random values where random_seed is used.

Also applies to: 1489-1491

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12f886c and fc88a68.

📒 Files selected for processing (2)
  • cvat/apps/engine/serializers.py (5 hunks)
  • cvat/apps/engine/task_validation.py (1 hunks)
🔇 Additional comments (3)
cvat/apps/engine/task_validation.py (2)

25-41: Validate negative or unexpected count inputs in from_iterable and from_counts methods

While the new code effectively constructs “bags” grouped by usage count, it does not guard against invalid usage counts (e.g., negative). This could happen if the input data source has errors or if the code inadvertently computes a negative count. Adding validation or safely clamping negative counts to zero would prevent silent anomalies.

Would you like a script to scan the codebase for possible negative usage count assignments or calls?


Line range hint 1489-1491: Post-initialization check for random_seed usage in frame_selector

When assigning “frame_selector = HoneypotFrameSelector(active_validation_frame_counts)” (line 1490–1491), confirm that any random seed defined at the Task level or within the global context is properly passed to HoneypotFrameSelector. If the seed is omitted, the resulting random selection may yield nondeterministic behavior, which can complicate reproducibility.

cvat/apps/engine/serializers.py (1)

Line range hint 1362-1374: Verify that the bulk context usage covers all frames thoroughly

The _TaskValidationLayoutBulkUpdateContext is a convenient way to manage bulk updates of frames. However, if frames are partially updated in other layers or if external code modifies frames after initialization, verify that these changes do not bypass the context. Consider adding a check or log if frames are found in an inconsistent state at the end of bulk operations.


def shuffle(self, *, rng: np.random.Generator | None):
if not rng:
rng = np.random.default_rng()

Check notice

Code scanning / SonarCloud

Results that depend on random number generation should be reproducible Low

Provide a seed for this random generator. See more on SonarQube Cloud
@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 10 lines in your changes missing coverage. Please review.

Project coverage is 73.94%. Comparing base (9a25291) to head (7b9fb5d).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8857      +/-   ##
===========================================
+ Coverage    73.89%   73.94%   +0.05%     
===========================================
  Files          408      408              
  Lines        44131    44214      +83     
  Branches      3986     3993       +7     
===========================================
+ Hits         32611    32696      +85     
+ Misses       11520    11518       -2     
Components Coverage Δ
cvat-ui 78.38% <76.78%> (-0.02%) ⬇️
cvat-server 70.17% <85.71%> (+0.10%) ⬆️

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants