This is an enhanced, lightweight document intelligence system designed for Round 1B of the Persona-Driven Document Intelligence challenge. The system extracts and prioritizes the most relevant sections from document collections based on specific personas and their job-to-be-done requirements.
- Lightweight Design: Total model size ~150-250MB (well under 1GB limit)
- Fast Processing: Optimized for 60-second processing time
- Persona-Aware: Specialized processing for different user types
- High Quality: Sophisticated algorithms despite lightweight approach
- Robust Error Handling: Comprehensive fallback mechanisms
- CPU-Only: No GPU dependencies
-
Enhanced Text Extraction (
src/extract_text.py)- Robust PDF parsing with intelligent section detection
- Font-based heading identification
- Hierarchical structure recognition
-
Quality-Optimized Embedding (
src/embed_text.py)- Uses all-MiniLM-L6-v2 (90MB) - UPGRADED for better quality!
- Efficient caching system
- Text chunking for better representation
-
Enhanced Ranking (
src/rank_sections.py)- Multi-criteria scoring (semantic, persona, actionability, quality)
- Persona-specific keyword weighting
- Diversity optimization
-
Lightweight Refinement (
src/refine_subsections.py)- NLTK-based text processing
- Persona-specific extraction strategies
- Rule-based summarization
- Python 3.10 or higher
- Docker (optional, for containerized execution)
-
Clone the repository
git clone <repository-url> cd adolfee_1b
-
Install dependencies
pip install -r requirements.txt
-
Verify installation
python model_size_checker.py
-
Build the Docker image
docker build -t document-intelligence . -
Verify the container
docker run --rm document-intelligence python model_size_checker.py
The system supports processing different document collections:
# Process Travel Planning collection (default)
python run.py --collection Collection_1_Travel_Planning
# Process Adobe Acrobat collection
python run.py --collection Collection_2_Adobe_Acrobat
# Clean up temporary files after processing
python run.py --collection Collection_1_Travel_Planning --cleanup# Process with Docker
docker run -v $(pwd):/app document-intelligence python run.py --collection Collection_1_Travel_Planning
# Interactive mode
docker run -it -v $(pwd):/app document-intelligence bashThe system expects input files in the following structure:
Collection_X/
├── challenge1b_input.json
└── PDFs/
├── document1.pdf
├── document2.pdf
└── ...
Example challenge1b_input.json:
{
"documents": [
{"filename": "document1.pdf", "title": "Document 1"},
{"filename": "document2.pdf", "title": "Document 2"}
],
"persona": {"role": "Travel Planner"},
"job_to_be_done": {"task": "Plan a trip for 4 days"}
}The system generates challenge1b_output.json with the following structure:
{
"metadata": {
"input_documents": ["document1.pdf", "document2.pdf"],
"persona": "Travel Planner",
"job_to_be_done": "Plan a trip for 4 days",
"processing_timestamp": "2025-01-XX..."
},
"extracted_sections": [
{
"document": "document1.pdf",
"section_title": "Section Title",
"page_number": 1,
"importance_rank": 1
}
],
"subsection_analysis": [
{
"document": "document1.pdf",
"page_number": 1,
"refined_text": "Extracted and refined content..."
}
]
}python model_size_checker.pyThis utility checks:
- Current model cache sizes
- Estimated model sizes
- Compliance with 1GB limit
The system automatically tracks:
- Processing time
- Memory usage
- CPU usage
- Model loading time
The system is optimized for the following personas:
-
Travel Planner
- Keywords: hotel, restaurant, attraction, transport, booking
- Focus: Actionable travel information
-
Researcher
- Keywords: methodology, results, conclusion, data, analysis
- Focus: Key findings and research insights
-
Student
- Keywords: concept, definition, example, important, key point
- Focus: Learning points and educational content
-
Business Analyst
- Keywords: revenue, profit, market, strategy, performance
- Focus: Business insights and metrics
-
HR Professional
- Keywords: form, process, procedure, policy, compliance
- Focus: Procedural information and workflows
PYTHONPATH: Application pathTRANSFORMERS_CACHE: Model cache directoryHF_HOME: Hugging Face cache directoryNLTK_DATA: NLTK data directorySENTENCE_TRANSFORMERS_HOME: Sentence transformers cache
max_processing_time: 60 secondsmax_model_size_mb: 1024 MB (1GB)top_k_sections: 5 (number of top sections to extract)
-
Model Download Issues
# Clear model cache rm -rf ~/.cache/huggingface ~/.cache/torch
-
Memory Issues
# Check available memory python -c "import psutil; print(psutil.virtual_memory())"
-
PDF Processing Issues
# Verify PDF files python -c "import fitz; doc = fitz.open('document.pdf'); print(len(doc))"
The system generates detailed logs in document_intelligence.log:
- Processing steps
- Error messages
- Performance metrics
- Model loading information
The system includes comprehensive validation:
- Input data structure validation
- Output format validation
- Performance constraint checking
- Model size verification
# Run model size check
python model_size_checker.py
# Test with sample data
python run.py --collection Collection_1_Travel_Planning- all-MiniLM-L6-v2: ~90MB (UPGRADED for better quality!)
- PyTorch Runtime: ~50-100MB
- Other Dependencies: ~50-100MB
- Total: ~180-280MB
- Small collections (3-5 documents): 15-30 seconds
- Medium collections (6-8 documents): 30-45 seconds
- Large collections (9-10 documents): 45-60 seconds
- Peak memory: ~500-800MB
- Average memory: ~300-500MB
- Cache size: ~100-200MB
-
Install development dependencies
pip install -r requirements.txt pip install pytest black flake8
-
Run tests
pytest tests/
-
Code formatting
black src/ flake8 src/
This project is developed for the Round 1B Document Intelligence Challenge.
For issues and questions:
- Check the troubleshooting section
- Review the logs in
document_intelligence.log - Run the model size checker for verification
- Ensure all dependencies are properly installed