MVP version of PDF deep processor (#73) #175

fabnemEPFL · 2025-10-02T08:51:09Z

PDF Image Analysis Enhancement

Overview

This PR adds advanced image analysis capabilities to the PDF processor, enabling text extraction from images embedded within PDF files. This feature is particularly valuable for scanned documents, diagrams, charts, and other visual content.

Features

Added two image analysis options:
- SmolDocling (Open Source): Uses Google's SmolDocling model for local image analysis without API requirements
- MistralOCR (API-based): Leverages Mistral's OCR API for high-quality text extraction from images
Updated PDF processor to detect and analyze embedded images
Added configuration options to enable/disable image analysis and select the analyzer type
Implemented batch processing for efficient handling of multiple images

Implementation Details

Created SmolDoclingImageAnalyzer class for local image analysis using transformers
Created MistralOCRImageAnalyzer class for API-based image analysis
Added configuration parameters in PDFProcessor to control image analysis behavior
Updated documentation with configuration examples and usage instructions
Added comprehensive unit tests for both analyzer implementations

Configuration

dispatcher_config:
  processor_config:
    PDFProcessor:
      - analyze_images: true  # Enable image analysis
      - image_analyzer_type: "smoldocling"  # Options: "smoldocling" or "mistral"

Testing

Added unit tests for both SmolDocling and MistralOCR analyzers
Tests include mocked implementations to avoid actual API calls or model loading during testing

Co-authored-by: Triomphe Achille <[email protected]> Co-authored-by: fabnemEPFL <[email protected]> Co-authored-by: Fabrice Nemo <[email protected]>

…p-processor

Triomphe Achille and others added 10 commits April 15, 2025 15:03

MVP version of PDF deep processor

c55fce2

Merge branch 'master' into pdf-deep-processor

779902a

reformatting with isort and ruff

182d0b4

Merge branch 'master' into pdf-deep-processor

ff57aba

fix for tests

6711b87

MVP version of PDF deep processor (#73)

5aecd98

Co-authored-by: Triomphe Achille <[email protected]> Co-authored-by: fabnemEPFL <[email protected]> Co-authored-by: Fabrice Nemo <[email protected]>

Merge remote-tracking branch 'origin/pdf-deep-processor' into pdf-dee…

2971421

…p-processor

fixes for useless imports

ebe2bb5

fix for linter

b626865

again

71f467a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MVP version of PDF deep processor (#73) #175

MVP version of PDF deep processor (#73) #175

Uh oh!

fabnemEPFL commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MVP version of PDF deep processor (#73) #175

Are you sure you want to change the base?

MVP version of PDF deep processor (#73) #175

Uh oh!

Conversation

fabnemEPFL commented Oct 2, 2025

PDF Image Analysis Enhancement

Overview

Features

Implementation Details

Configuration

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants