Skip to content

MNakhaeiR/PDF-to-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Text Converter

A robust Python application for extracting text from PDF files using multiple extraction methods including direct text extraction and OCR (Optical Character Recognition) for image-based PDFs.

Features

  • Multiple extraction methods: PyPDF2, pdfplumber, and OCR with Tesseract
  • Automatic fallback: If one method fails, automatically tries the next
  • OCR support: Handles image-based PDFs and scanned documents
  • Image extraction: Option to save images extracted from PDFs
  • Batch processing: Convert multiple PDFs at once
  • Configurable: Easy-to-modify configuration settings
  • Logging: Comprehensive logging for debugging and monitoring
  • Cross-platform: Works on Windows, macOS, and Linux

Prerequisites

System Requirements

  1. Python 3.7+
  2. Tesseract OCR (for OCR functionality)

Installing Tesseract

Windows

  1. Download Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
  2. Install to the default location: C:\Program Files\Tesseract-OCR\
  3. Update the TESSERACT_PATH in config.py if installed elsewhere

macOS

brew install tesseract

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr

Installation

  1. Clone or download this repository

  2. Install Python dependencies:

    pip install -r requirements.txt
  3. Verify Tesseract installation:

    tesseract --version

Usage

Command Line Interface

Convert a single PDF:

python main.py document.pdf

Convert with custom output path:

python main.py document.pdf -o output.txt

Convert all PDFs in a directory:

python main.py -d /path/to/pdf/folder/

Enable verbose logging:

python main.py document.pdf -v

Save extracted images:

python main.py document.pdf --save-images

Disable OCR fallback:

python main.py document.pdf --no-ocr

Programmatic Usage

from utils.text_extractor import TextExtractor

# Initialize extractor
extractor = TextExtractor()

# Extract text from PDF
text = extractor.extract_text('document.pdf')

# Get PDF information
pdf_info = extractor.get_pdf_info('document.pdf')
print(f"Pages: {pdf_info['num_pages']}")

Configuration

Edit config.py to customize the application:

# OCR Configuration
TESSERACT_PATH = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
OCR_LANGUAGE = 'eng'  # Change for other languages

# Output Configuration
OUTPUT_DIR = 'output'
SAVE_IMAGES = False

# Processing Configuration
DPI = 300  # Higher DPI = better quality, slower processing
USE_OCR_FALLBACK = True

Project Structure

pdf_to_txt_converter/
├── main.py              # Main application entry point
├── requirements.txt     # Python dependencies
├── config.py            # Configuration settings
├── utils/
│   ├── __init__.py
│   ├── text_extractor.py  # Core text extraction logic
│   ├── ocr_handler.py     # OCR processing
│   └── image_saver.py     # Image saving utilities
└── README.md            # This file

How It Works

  1. Text-based PDFs: First tries pdfplumber (most accurate), then PyPDF2
  2. Image-based PDFs: If text extraction yields poor results, automatically falls back to OCR
  3. OCR Process: Converts PDF pages to images, preprocesses them, and uses Tesseract to extract text
  4. Output: Saves extracted text with metadata including page information

Supported Languages for OCR

By default, the application uses English (eng). To use other languages:

  1. Install additional Tesseract language packs
  2. Update OCR_LANGUAGE in config.py

Common language codes:

  • eng - English
  • fra - French
  • deu - German
  • spa - Spanish
  • chi_sim - Chinese Simplified

Troubleshooting

Common Issues

  1. "Tesseract not found":

    • Verify Tesseract installation
    • Check TESSERACT_PATH in config.py
  2. Poor OCR results:

    • Increase DPI in config.py
    • Try different OCR_CONFIG settings
  3. Out of memory errors:

    • Reduce DPI setting
    • Process smaller PDFs
  4. No text extracted:

    • PDF might be password protected
    • PDF might be corrupted
    • Try with --save-images to check image quality

Logging

Check pdf_converter.log for detailed error information and processing logs.

Examples

Basic Conversion

python main.py invoice.pdf
# Output: output/invoice.txt

Batch Processing with Images

python main.py -d ./documents/ --save-images -v
# Converts all PDFs in documents/ folder
# Saves images in images/ folder
# Enables verbose logging

Custom Configuration

# Modify config.py for specific needs
DPI = 600  # Higher quality for fine text
OCR_LANGUAGE = 'fra'  # French language
SAVE_IMAGES = True  # Always save images

Dependencies

  • PyPDF2: Basic PDF text extraction
  • pdfplumber: Advanced PDF text extraction
  • pytesseract: OCR engine interface
  • Pillow: Image processing
  • pdf2image: PDF to image conversion
  • opencv-python: Image preprocessing
  • numpy: Numerical operations

License

This project is open source and available under the MIT License.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the logs in pdf_converter.log
  3. Open an issue with details about your PDF and error messages

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages