A robust Python application for extracting text from PDF files using multiple extraction methods including direct text extraction and OCR (Optical Character Recognition) for image-based PDFs.
- Multiple extraction methods: PyPDF2, pdfplumber, and OCR with Tesseract
- Automatic fallback: If one method fails, automatically tries the next
- OCR support: Handles image-based PDFs and scanned documents
- Image extraction: Option to save images extracted from PDFs
- Batch processing: Convert multiple PDFs at once
- Configurable: Easy-to-modify configuration settings
- Logging: Comprehensive logging for debugging and monitoring
- Cross-platform: Works on Windows, macOS, and Linux
- Python 3.7+
- Tesseract OCR (for OCR functionality)
- Download Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
- Install to the default location:
C:\Program Files\Tesseract-OCR\
- Update the
TESSERACT_PATH
inconfig.py
if installed elsewhere
brew install tesseract
sudo apt-get update
sudo apt-get install tesseract-ocr
-
Clone or download this repository
-
Install Python dependencies:
pip install -r requirements.txt
-
Verify Tesseract installation:
tesseract --version
python main.py document.pdf
python main.py document.pdf -o output.txt
python main.py -d /path/to/pdf/folder/
python main.py document.pdf -v
python main.py document.pdf --save-images
python main.py document.pdf --no-ocr
from utils.text_extractor import TextExtractor
# Initialize extractor
extractor = TextExtractor()
# Extract text from PDF
text = extractor.extract_text('document.pdf')
# Get PDF information
pdf_info = extractor.get_pdf_info('document.pdf')
print(f"Pages: {pdf_info['num_pages']}")
Edit config.py
to customize the application:
# OCR Configuration
TESSERACT_PATH = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
OCR_LANGUAGE = 'eng' # Change for other languages
# Output Configuration
OUTPUT_DIR = 'output'
SAVE_IMAGES = False
# Processing Configuration
DPI = 300 # Higher DPI = better quality, slower processing
USE_OCR_FALLBACK = True
pdf_to_txt_converter/
├── main.py # Main application entry point
├── requirements.txt # Python dependencies
├── config.py # Configuration settings
├── utils/
│ ├── __init__.py
│ ├── text_extractor.py # Core text extraction logic
│ ├── ocr_handler.py # OCR processing
│ └── image_saver.py # Image saving utilities
└── README.md # This file
- Text-based PDFs: First tries
pdfplumber
(most accurate), thenPyPDF2
- Image-based PDFs: If text extraction yields poor results, automatically falls back to OCR
- OCR Process: Converts PDF pages to images, preprocesses them, and uses Tesseract to extract text
- Output: Saves extracted text with metadata including page information
By default, the application uses English (eng
). To use other languages:
- Install additional Tesseract language packs
- Update
OCR_LANGUAGE
inconfig.py
Common language codes:
eng
- Englishfra
- Frenchdeu
- Germanspa
- Spanishchi_sim
- Chinese Simplified
-
"Tesseract not found":
- Verify Tesseract installation
- Check
TESSERACT_PATH
inconfig.py
-
Poor OCR results:
- Increase
DPI
inconfig.py
- Try different
OCR_CONFIG
settings
- Increase
-
Out of memory errors:
- Reduce
DPI
setting - Process smaller PDFs
- Reduce
-
No text extracted:
- PDF might be password protected
- PDF might be corrupted
- Try with
--save-images
to check image quality
Check pdf_converter.log
for detailed error information and processing logs.
python main.py invoice.pdf
# Output: output/invoice.txt
python main.py -d ./documents/ --save-images -v
# Converts all PDFs in documents/ folder
# Saves images in images/ folder
# Enables verbose logging
# Modify config.py for specific needs
DPI = 600 # Higher quality for fine text
OCR_LANGUAGE = 'fra' # French language
SAVE_IMAGES = True # Always save images
- PyPDF2: Basic PDF text extraction
- pdfplumber: Advanced PDF text extraction
- pytesseract: OCR engine interface
- Pillow: Image processing
- pdf2image: PDF to image conversion
- opencv-python: Image preprocessing
- numpy: Numerical operations
This project is open source and available under the MIT License.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Check the troubleshooting section
- Review the logs in
pdf_converter.log
- Open an issue with details about your PDF and error messages