Enhanced Document Intelligence System

Overview

This is an enhanced, lightweight document intelligence system designed for Round 1B of the Persona-Driven Document Intelligence challenge. The system extracts and prioritizes the most relevant sections from document collections based on specific personas and their job-to-be-done requirements.

Key Features

Lightweight Design: Total model size ~150-250MB (well under 1GB limit)
Fast Processing: Optimized for 60-second processing time
Persona-Aware: Specialized processing for different user types
High Quality: Sophisticated algorithms despite lightweight approach
Robust Error Handling: Comprehensive fallback mechanisms
CPU-Only: No GPU dependencies

System Architecture

Core Components

Enhanced Text Extraction (src/extract_text.py)
- Robust PDF parsing with intelligent section detection
- Font-based heading identification
- Hierarchical structure recognition
Quality-Optimized Embedding (src/embed_text.py)
- Uses all-MiniLM-L6-v2 (90MB) - UPGRADED for better quality!
- Efficient caching system
- Text chunking for better representation
Enhanced Ranking (src/rank_sections.py)
- Multi-criteria scoring (semantic, persona, actionability, quality)
- Persona-specific keyword weighting
- Diversity optimization
Lightweight Refinement (src/refine_subsections.py)
- NLTK-based text processing
- Persona-specific extraction strategies
- Rule-based summarization

Installation

Prerequisites

Python 3.10 or higher
Docker (optional, for containerized execution)

Local Installation

Clone the repository

git clone <repository-url>
cd adolfee_1b

Install dependencies
```
pip install -r requirements.txt
```
Verify installation
```
python model_size_checker.py
```

Docker Installation

Build the Docker image
```
docker build -t document-intelligence .
```

Verify the container

docker run --rm document-intelligence python model_size_checker.py

Usage

Command Line Interface

The system supports processing different document collections:

# Process Travel Planning collection (default)
python run.py --collection Collection_1_Travel_Planning

# Process Adobe Acrobat collection
python run.py --collection Collection_2_Adobe_Acrobat

# Clean up temporary files after processing
python run.py --collection Collection_1_Travel_Planning --cleanup

Docker Usage

# Process with Docker
docker run -v $(pwd):/app document-intelligence python run.py --collection Collection_1_Travel_Planning

# Interactive mode
docker run -it -v $(pwd):/app document-intelligence bash

Input Format

The system expects input files in the following structure:

Collection_X/
├── challenge1b_input.json
└── PDFs/
    ├── document1.pdf
    ├── document2.pdf
    └── ...

Example challenge1b_input.json:

{
    "documents": [
        {"filename": "document1.pdf", "title": "Document 1"},
        {"filename": "document2.pdf", "title": "Document 2"}
    ],
    "persona": {"role": "Travel Planner"},
    "job_to_be_done": {"task": "Plan a trip for 4 days"}
}

Output Format

The system generates challenge1b_output.json with the following structure:

{
    "metadata": {
        "input_documents": ["document1.pdf", "document2.pdf"],
        "persona": "Travel Planner",
        "job_to_be_done": "Plan a trip for 4 days",
        "processing_timestamp": "2025-01-XX..."
    },
    "extracted_sections": [
        {
            "document": "document1.pdf",
            "section_title": "Section Title",
            "page_number": 1,
            "importance_rank": 1
        }
    ],
    "subsection_analysis": [
        {
            "document": "document1.pdf",
            "page_number": 1,
            "refined_text": "Extracted and refined content..."
        }
    ]
}

Performance Monitoring

Model Size Verification

python model_size_checker.py

This utility checks:

Current model cache sizes
Estimated model sizes
Compliance with 1GB limit

Performance Metrics

The system automatically tracks:

Processing time
Memory usage
CPU usage
Model loading time

Supported Personas

The system is optimized for the following personas:

Travel Planner
- Keywords: hotel, restaurant, attraction, transport, booking
- Focus: Actionable travel information
Researcher
- Keywords: methodology, results, conclusion, data, analysis
- Focus: Key findings and research insights
Student
- Keywords: concept, definition, example, important, key point
- Focus: Learning points and educational content
Business Analyst
- Keywords: revenue, profit, market, strategy, performance
- Focus: Business insights and metrics
HR Professional
- Keywords: form, process, procedure, policy, compliance
- Focus: Procedural information and workflows

Configuration

Environment Variables

PYTHONPATH: Application path
TRANSFORMERS_CACHE: Model cache directory
HF_HOME: Hugging Face cache directory
NLTK_DATA: NLTK data directory
SENTENCE_TRANSFORMERS_HOME: Sentence transformers cache

System Parameters

max_processing_time: 60 seconds
max_model_size_mb: 1024 MB (1GB)
top_k_sections: 5 (number of top sections to extract)

Troubleshooting

Common Issues

Model Download Issues

# Clear model cache
rm -rf ~/.cache/huggingface ~/.cache/torch

Memory Issues

# Check available memory
python -c "import psutil; print(psutil.virtual_memory())"

PDF Processing Issues

# Verify PDF files
python -c "import fitz; doc = fitz.open('document.pdf'); print(len(doc))"

Logging

The system generates detailed logs in document_intelligence.log:

Processing steps
Error messages
Performance metrics
Model loading information

Quality Assurance

Validation

The system includes comprehensive validation:

Input data structure validation
Output format validation
Performance constraint checking
Model size verification

Testing

# Run model size check
python model_size_checker.py

# Test with sample data
python run.py --collection Collection_1_Travel_Planning

Performance Characteristics

Model Sizes

all-MiniLM-L6-v2: ~90MB (UPGRADED for better quality!)
PyTorch Runtime: ~50-100MB
Other Dependencies: ~50-100MB
Total: ~180-280MB

Processing Times

Small collections (3-5 documents): 15-30 seconds
Medium collections (6-8 documents): 30-45 seconds
Large collections (9-10 documents): 45-60 seconds

Memory Usage

Peak memory: ~500-800MB
Average memory: ~300-500MB
Cache size: ~100-200MB

Contributing

Development Setup

Install development dependencies

pip install -r requirements.txt
pip install pytest black flake8

Run tests
```
pytest tests/
```
Code formatting
```
black src/
flake8 src/
```

License

This project is developed for the Round 1B Document Intelligence Challenge.

Support

For issues and questions:

Check the troubleshooting section
Review the logs in document_intelligence.log
Run the model size checker for verification
Ensure all dependencies are properly installed

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Collection_1_Travel_Planning		Collection_1_Travel_Planning
Collection_2_Adobe_Acrobat		Collection_2_Adobe_Acrobat
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
approach_explanation.md		approach_explanation.md
model_size_checker.py		model_size_checker.py
requirements.txt		requirements.txt
run.py		run.py
test_system.py		test_system.py

SuseelKumarG/adolfee_1b

Folders and files

Latest commit

History

Repository files navigation