Skip to content

Conversation

@fabnemEPFL
Copy link
Collaborator

PDF Image Analysis Enhancement

Overview

This PR adds advanced image analysis capabilities to the PDF processor, enabling text extraction from images embedded within PDF files. This feature is particularly valuable for scanned documents, diagrams, charts, and other visual content.

Features

  • Added two image analysis options:
    • SmolDocling (Open Source): Uses Google's SmolDocling model for local image analysis without API requirements
    • MistralOCR (API-based): Leverages Mistral's OCR API for high-quality text extraction from images
  • Updated PDF processor to detect and analyze embedded images
  • Added configuration options to enable/disable image analysis and select the analyzer type
  • Implemented batch processing for efficient handling of multiple images

Implementation Details

  • Created SmolDoclingImageAnalyzer class for local image analysis using transformers
  • Created MistralOCRImageAnalyzer class for API-based image analysis
  • Added configuration parameters in PDFProcessor to control image analysis behavior
  • Updated documentation with configuration examples and usage instructions
  • Added comprehensive unit tests for both analyzer implementations

Configuration

dispatcher_config:
  processor_config:
    PDFProcessor:
      - analyze_images: true  # Enable image analysis
      - image_analyzer_type: "smoldocling"  # Options: "smoldocling" or "mistral"

Testing

  • Added unit tests for both SmolDocling and MistralOCR analyzers
  • Tests include mocked implementations to avoid actual API calls or model loading during testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants