Skip to content

SuseelKumarG/adolfee_1b

Repository files navigation

Enhanced Document Intelligence System

Overview

This is an enhanced, lightweight document intelligence system designed for Round 1B of the Persona-Driven Document Intelligence challenge. The system extracts and prioritizes the most relevant sections from document collections based on specific personas and their job-to-be-done requirements.

Key Features

  • Lightweight Design: Total model size ~150-250MB (well under 1GB limit)
  • Fast Processing: Optimized for 60-second processing time
  • Persona-Aware: Specialized processing for different user types
  • High Quality: Sophisticated algorithms despite lightweight approach
  • Robust Error Handling: Comprehensive fallback mechanisms
  • CPU-Only: No GPU dependencies

System Architecture

Core Components

  1. Enhanced Text Extraction (src/extract_text.py)

    • Robust PDF parsing with intelligent section detection
    • Font-based heading identification
    • Hierarchical structure recognition
  2. Quality-Optimized Embedding (src/embed_text.py)

    • Uses all-MiniLM-L6-v2 (90MB) - UPGRADED for better quality!
    • Efficient caching system
    • Text chunking for better representation
  3. Enhanced Ranking (src/rank_sections.py)

    • Multi-criteria scoring (semantic, persona, actionability, quality)
    • Persona-specific keyword weighting
    • Diversity optimization
  4. Lightweight Refinement (src/refine_subsections.py)

    • NLTK-based text processing
    • Persona-specific extraction strategies
    • Rule-based summarization

Installation

Prerequisites

  • Python 3.10 or higher
  • Docker (optional, for containerized execution)

Local Installation

  1. Clone the repository

    git clone <repository-url>
    cd adolfee_1b
  2. Install dependencies

    pip install -r requirements.txt
  3. Verify installation

    python model_size_checker.py

Docker Installation

  1. Build the Docker image

    docker build -t document-intelligence .
  2. Verify the container

    docker run --rm document-intelligence python model_size_checker.py

Usage

Command Line Interface

The system supports processing different document collections:

# Process Travel Planning collection (default)
python run.py --collection Collection_1_Travel_Planning

# Process Adobe Acrobat collection
python run.py --collection Collection_2_Adobe_Acrobat

# Clean up temporary files after processing
python run.py --collection Collection_1_Travel_Planning --cleanup

Docker Usage

# Process with Docker
docker run -v $(pwd):/app document-intelligence python run.py --collection Collection_1_Travel_Planning

# Interactive mode
docker run -it -v $(pwd):/app document-intelligence bash

Input Format

The system expects input files in the following structure:

Collection_X/
├── challenge1b_input.json
└── PDFs/
    ├── document1.pdf
    ├── document2.pdf
    └── ...

Example challenge1b_input.json:

{
    "documents": [
        {"filename": "document1.pdf", "title": "Document 1"},
        {"filename": "document2.pdf", "title": "Document 2"}
    ],
    "persona": {"role": "Travel Planner"},
    "job_to_be_done": {"task": "Plan a trip for 4 days"}
}

Output Format

The system generates challenge1b_output.json with the following structure:

{
    "metadata": {
        "input_documents": ["document1.pdf", "document2.pdf"],
        "persona": "Travel Planner",
        "job_to_be_done": "Plan a trip for 4 days",
        "processing_timestamp": "2025-01-XX..."
    },
    "extracted_sections": [
        {
            "document": "document1.pdf",
            "section_title": "Section Title",
            "page_number": 1,
            "importance_rank": 1
        }
    ],
    "subsection_analysis": [
        {
            "document": "document1.pdf",
            "page_number": 1,
            "refined_text": "Extracted and refined content..."
        }
    ]
}

Performance Monitoring

Model Size Verification

python model_size_checker.py

This utility checks:

  • Current model cache sizes
  • Estimated model sizes
  • Compliance with 1GB limit

Performance Metrics

The system automatically tracks:

  • Processing time
  • Memory usage
  • CPU usage
  • Model loading time

Supported Personas

The system is optimized for the following personas:

  1. Travel Planner

    • Keywords: hotel, restaurant, attraction, transport, booking
    • Focus: Actionable travel information
  2. Researcher

    • Keywords: methodology, results, conclusion, data, analysis
    • Focus: Key findings and research insights
  3. Student

    • Keywords: concept, definition, example, important, key point
    • Focus: Learning points and educational content
  4. Business Analyst

    • Keywords: revenue, profit, market, strategy, performance
    • Focus: Business insights and metrics
  5. HR Professional

    • Keywords: form, process, procedure, policy, compliance
    • Focus: Procedural information and workflows

Configuration

Environment Variables

  • PYTHONPATH: Application path
  • TRANSFORMERS_CACHE: Model cache directory
  • HF_HOME: Hugging Face cache directory
  • NLTK_DATA: NLTK data directory
  • SENTENCE_TRANSFORMERS_HOME: Sentence transformers cache

System Parameters

  • max_processing_time: 60 seconds
  • max_model_size_mb: 1024 MB (1GB)
  • top_k_sections: 5 (number of top sections to extract)

Troubleshooting

Common Issues

  1. Model Download Issues

    # Clear model cache
    rm -rf ~/.cache/huggingface ~/.cache/torch
  2. Memory Issues

    # Check available memory
    python -c "import psutil; print(psutil.virtual_memory())"
  3. PDF Processing Issues

    # Verify PDF files
    python -c "import fitz; doc = fitz.open('document.pdf'); print(len(doc))"

Logging

The system generates detailed logs in document_intelligence.log:

  • Processing steps
  • Error messages
  • Performance metrics
  • Model loading information

Quality Assurance

Validation

The system includes comprehensive validation:

  • Input data structure validation
  • Output format validation
  • Performance constraint checking
  • Model size verification

Testing

# Run model size check
python model_size_checker.py

# Test with sample data
python run.py --collection Collection_1_Travel_Planning

Performance Characteristics

Model Sizes

  • all-MiniLM-L6-v2: ~90MB (UPGRADED for better quality!)
  • PyTorch Runtime: ~50-100MB
  • Other Dependencies: ~50-100MB
  • Total: ~180-280MB

Processing Times

  • Small collections (3-5 documents): 15-30 seconds
  • Medium collections (6-8 documents): 30-45 seconds
  • Large collections (9-10 documents): 45-60 seconds

Memory Usage

  • Peak memory: ~500-800MB
  • Average memory: ~300-500MB
  • Cache size: ~100-200MB

Contributing

Development Setup

  1. Install development dependencies

    pip install -r requirements.txt
    pip install pytest black flake8
  2. Run tests

    pytest tests/
  3. Code formatting

    black src/
    flake8 src/

License

This project is developed for the Round 1B Document Intelligence Challenge.

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the logs in document_intelligence.log
  3. Run the model size checker for verification
  4. Ensure all dependencies are properly installed

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •