Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize extractors, classifiers #134

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mschfh
Copy link
Collaborator

@mschfh mschfh commented Dec 15, 2024

No description provided.

@mschfh mschfh self-assigned this Dec 15, 2024
@mschfh mschfh requested a review from antonok-edm as a code owner December 15, 2024 11:36
@mschfh mschfh marked this pull request as draft December 15, 2024 11:38
Copy link

[puLL-Merge] - brave/cookiemonster@134

Description

This PR introduces significant changes to the cookie consent detection system, including new extractors and classifiers, improved test cases, and development enhancements. The changes aim to improve the accuracy and flexibility of cookie consent notice detection.

Changes

Changes

  1. Dockerfile:

    • Added a development mode option controlled by the DEV environment variable.
  2. docker-compose.yml:

    • Set DEV=true for development environment.
  3. package.json:

    • Added nodemon as a dev dependency for improved development experience.
    • Introduced a new dev script using nodemon for auto-reloading during development.
  4. src/classifiers/index.mjs (New file):

    • Implemented a new classifier system with support for multiple classifiers (LLM and keyword).
    • Added a weight-based verdict system for combining classifier results.
  5. src/classifiers/keyword.mjs (New file):

    • Implemented a simple keyword-based classifier.
  6. src/classifiers/llm.mjs (New file):

    • Implemented an LLM-based classifier using OpenAI's API.
  7. src/extractors/html.mjs (New file):

    • Implemented a new HTML-based extractor for identifying potential cookie consent notices.
  8. src/extractors/index.mjs (New file):

    • Added support for multiple extractors with a priority system.
  9. src/lib.mjs:

    • Refactored to use the new extractor and classifier systems.
    • Improved error handling and reporting.
  10. src/views/page.html.njk:

    • Updated the UI to display classifier results with color-coded tags.
  11. test/test.mjs:

    • Updated test cases to include expected classifier results.
    • Added support for concurrent testing with customizable concurrency.
sequenceDiagram
    participant User
    participant API
    participant Extractors
    participant Classifiers
    participant LLM
    participant Browser

    User->>API: Request page check
    API->>Browser: Load page
    Browser->>API: Page loaded
    API->>Extractors: Extract potential elements
    Extractors->>API: Candidate elements
    loop For each candidate element
        API->>Classifiers: Classify element
        Classifiers->>LLM: LLM classification
        LLM->>Classifiers: LLM result
        Classifiers->>API: Classification results
    end
    API->>User: Return detection results
Loading

Possible Issues

  1. The new system might be more resource-intensive due to the use of multiple classifiers and extractors.
  2. The LLM-based classifier depends on an external API, which could introduce latency or reliability issues.

Security Hotspots

  1. The use of eval in the extractFrameText function in src/lib.mjs could potentially execute malicious code if not properly sanitized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant