PDF to Text Converter

A robust Python application for extracting text from PDF files using multiple extraction methods including direct text extraction and OCR (Optical Character Recognition) for image-based PDFs.

Features

Multiple extraction methods: PyPDF2, pdfplumber, and OCR with Tesseract
Automatic fallback: If one method fails, automatically tries the next
OCR support: Handles image-based PDFs and scanned documents
Image extraction: Option to save images extracted from PDFs
Batch processing: Convert multiple PDFs at once
Configurable: Easy-to-modify configuration settings
Logging: Comprehensive logging for debugging and monitoring
Cross-platform: Works on Windows, macOS, and Linux

Prerequisites

System Requirements

Python 3.7+
Tesseract OCR (for OCR functionality)

Installing Tesseract

Windows

Download Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
Install to the default location: C:\Program Files\Tesseract-OCR\
Update the TESSERACT_PATH in config.py if installed elsewhere

macOS

brew install tesseract

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr

Installation

Clone or download this repository
Install Python dependencies:
```
pip install -r requirements.txt
```
Verify Tesseract installation:
```
tesseract --version
```

Usage

Command Line Interface

Convert a single PDF:

python main.py document.pdf

Convert with custom output path:

python main.py document.pdf -o output.txt

Convert all PDFs in a directory:

python main.py -d /path/to/pdf/folder/

Enable verbose logging:

python main.py document.pdf -v

Save extracted images:

python main.py document.pdf --save-images

Disable OCR fallback:

python main.py document.pdf --no-ocr

Programmatic Usage

from utils.text_extractor import TextExtractor

# Initialize extractor
extractor = TextExtractor()

# Extract text from PDF
text = extractor.extract_text('document.pdf')

# Get PDF information
pdf_info = extractor.get_pdf_info('document.pdf')
print(f"Pages: {pdf_info['num_pages']}")

Configuration

Edit config.py to customize the application:

# OCR Configuration
TESSERACT_PATH = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
OCR_LANGUAGE = 'eng'  # Change for other languages

# Output Configuration
OUTPUT_DIR = 'output'
SAVE_IMAGES = False

# Processing Configuration
DPI = 300  # Higher DPI = better quality, slower processing
USE_OCR_FALLBACK = True

Project Structure

pdf_to_txt_converter/
├── main.py              # Main application entry point
├── requirements.txt     # Python dependencies
├── config.py            # Configuration settings
├── utils/
│   ├── __init__.py
│   ├── text_extractor.py  # Core text extraction logic
│   ├── ocr_handler.py     # OCR processing
│   └── image_saver.py     # Image saving utilities
└── README.md            # This file

How It Works

Text-based PDFs: First tries pdfplumber (most accurate), then PyPDF2
Image-based PDFs: If text extraction yields poor results, automatically falls back to OCR
OCR Process: Converts PDF pages to images, preprocesses them, and uses Tesseract to extract text
Output: Saves extracted text with metadata including page information

Supported Languages for OCR

By default, the application uses English (eng). To use other languages:

Install additional Tesseract language packs
Update OCR_LANGUAGE in config.py

Common language codes:

eng - English
fra - French
deu - German
spa - Spanish
chi_sim - Chinese Simplified

Troubleshooting

Common Issues

"Tesseract not found":
- Verify Tesseract installation
- Check TESSERACT_PATH in config.py
Poor OCR results:
- Increase DPI in config.py
- Try different OCR_CONFIG settings
Out of memory errors:
- Reduce DPI setting
- Process smaller PDFs
No text extracted:
- PDF might be password protected
- PDF might be corrupted
- Try with --save-images to check image quality

Logging

Check pdf_converter.log for detailed error information and processing logs.

Examples

Basic Conversion

python main.py invoice.pdf
# Output: output/invoice.txt

Batch Processing with Images

python main.py -d ./documents/ --save-images -v
# Converts all PDFs in documents/ folder
# Saves images in images/ folder
# Enables verbose logging

Custom Configuration

# Modify config.py for specific needs
DPI = 600  # Higher quality for fine text
OCR_LANGUAGE = 'fra'  # French language
SAVE_IMAGES = True  # Always save images

Dependencies

PyPDF2: Basic PDF text extraction
pdfplumber: Advanced PDF text extraction
pytesseract: OCR engine interface
Pillow: Image processing
pdf2image: PDF to image conversion
opencv-python: Image preprocessing
numpy: Numerical operations

License

This project is open source and available under the MIT License.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Support

For issues and questions:

Check the troubleshooting section
Review the logs in pdf_converter.log
Open an issue with details about your PDF and error messages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

MNakhaeiR/PDF-to-Text

Folders and files

Latest commit

History

Repository files navigation