Offline Document Asseessbility (PDF)

Document accessibility ensures equal access to information for everyone, regardless of their abilities. It not only helps foster a more inclusive society but is also an ethical responsibility to respect diverse audience needs. Besides, investing in document accessibility can benefit businesses by expanding their audience and demonstrating a commitment to ethical practices and social responsibility.

Here is a Python CLI tool designed to detect and address WCAG accessibility issues identified in offline documents, targeting PDF. It aims to promote a user-friendly experience for everyone and enables users to access documents with assistive technology.

Installation

git clone https://github.com/caraaaaa/doc_accessibility.git
conda env create -f env/environment.yml
conda activate doc_accessibility_cli

Features

Searchable PDF Creator
Text Alternatives for Non-text Content
- Identify Non-text Content
- Generate Description for Non-text Content
Text Representation
- Text Constrast
- Line Space

Searchable PDF Creator

Convert scanned PDF document to searchable format using TesseractOCR.

Criteria: User can read or extract the words using assistive technologies, or manipulate the PDF for accessibility.

WACG guideline: 1.4.5

Scanned PDF	Searchable PDF

Usage Instruction

python script/scanned2searchable.py [-o OUTPUT_PDF_PATH] [-s] input_pdf_path

Default output path: readable_pdf.pdf

optional arguments:
  -o OUTPUT_PDF_PATH, --output_pdf_path OUTPUT_PDF_PATH
                        The path for the output searchable PDF file.
  -s, --show_result     Show text of the searchable PDF after OCR

Text Alternatives for Non-text Content

Provide descriptions for images inside searchable PDF using image classification, transformer model and OCR.

Criteria: All non-text content (e.g. images, formulas) that is presented to the user has a text alternative that serves the equivalent purpose

WACG guideline: 1.1

Overview

flowchart LR
    A[Extract Images] -->|Optional| B[Save Images]
    A --> C[Check Alt Text]
    C --> D[Classification]
    D -->|image of text| F[Perform OCR]
    D -->|non-text image| G[Generate Image Caption]
    F --> H[Provide Alt Text Suggestions]
    G --> H
    H -->|Optional| I[Output PDF with Bounding Box]

Loading

Usage Instruction

python script/extract_PDF_image.py [--output_img] [--output_folder OUTPUT_FOLDER] [--draw_bbox] [--output_pdf_path OUTPUT_PDF_PATH] [--captioning] input_pdf_path

optional arguments:
  --output_img          Output images extracted from the PDF.
  --output_folder OUTPUT_FOLDER
                        The directory for the output images.
  --draw_bbox           Output PDF with bounding box on images.
  --output_pdf_path OUTPUT_PDF_PATH
                        The path for the output PDF file with bounding box.
  --captioning          Generate caption for images.

Default output folder: pdf_image Default output pdf path: image_bbox.pdf

Identify Non-text Content

Part of Text Alternatives for Non-text Content

Identify images which without alternative text using PyMuPDF and Pillow.

Usage Instruction

python script/extract_PDF_image.py --draw_bbox input_pdf_path

Default output path: image_bbox.pdf

Image of Text Classifier

Part of Text Alternatives for Non-text Content

Classifiy if an image (such as JPG or PNG) primarily contains text before performing OCR or image captioning.

Image of Text	Non-text image

Fine-tuned a image classification model

Data: online-sourced images
Model: ResNet
Tools: fastai
Training Script: Google Colab

Usage Instruction

python script/image_of_text.py [--show_score] input_pdf_path

Optional arguments:
  --show_score    Show the classification score

Image Captioning

Part of Text Alternatives for Non-text Content

Generate descriptive caption for non-text image using Transformer model.

Non-text image	Caption
	Indication of correct signature

Fine-tuned a Transformer model

Data:
- Non-text images extracted from PDF
- Captioned Manually
- Dataset card
Model:
- Pre-trained: GenerativeImage2Text
- Fined-tuned: model card and inference API
Tools: HuggingFace Transformer, PyTorch
Training Script: Google Colab

Usage Instruction

python script/generate_caption.py <input_image_path>

OCR

Part of Text Alternatives for Non-text Content

Extract text from image-of-text using TesseractOCR

Usage Instruction

python script/image_of_text.py input_pdf_path

Text Constrast

Identify low contrast text inside PDF using image segmentation and contrast ratio analysis.

Criteria: The visual presentation of text and images of text has a contrast ratio of at least 4.5:1

WACG guideline: 1.4.3

Overview

flowchart LR
    A[Extract Text Bounding Box]
    A --> C[Image Segmentation] -->|Optional| B[Save model prediction]
    C --> D[Calculate contrast]
    D -->|Optional| E[Output PDF with Bounding Box]

Loading

Image of Text	Predicted Text Segmentation	Enought Constrast	Low Contrast

Fine-tuned a Transformer model

Data:
- Generate synthetic image of text with segmentation mask using Pillow
- Dataset Card
Model:
- Pre-trained: SegFormer
- Fined-tuned: model card
Tools: HuggingFace Transformer, PyTorch
Training Script: Google Colab

Calculate the contrast ratio

Luminance (brightness) of the colors:

$$L = 0.2126\times R+0.7152\times G+0.0722\times B$$

$$\small\text{where R, G, and B are normalized to 0-1}$$

Contrast Ratio $$\frac{L2+0.05}{L1+0.05}$$

$$\small\text{where L1 is the luminance of the lighter color, either text or background}$$

Usage Instruction

Generate synthetic image of text

python script/synthetic_text_seg.py [--sample_no SAMPLE_NO] [--output_folder OUTPUT_FOLDER] [--font_folder FONT_FOLDER]

Default output folder: image_of_text

Default font input folder: font

Identify low contrast text

python script/contrast_pdf.py [--output_bbox_img] [--output_dir OUTPUT_DIR] [--draw_bbox] [--output_pdf_path OUTPUT_PDF_PATH] [--bbox_extractor {PyMyPDF,pdfminer}] input_pdf_path

optional arguments:
  --output_bbox_img     Option to save text block images with low contrast.
  --output_dir OUTPUT_DIR
                        The directory for output images.
  --draw_bbox           Option to draw bounding boxes on low contrast text blocks.
  --output_pdf_path OUTPUT_PDF_PATH
                        The path for the output PDF file with drawn bounding boxes.
  --bbox_extractor {PyMyPDF,pdfminer}
                        Choice of bounding box extractor.

Default output folder: segmentation_model_output

Default output path with bounding box: bbox_low_contrast.pdf

Extract text bounding boxes (PDFMiner)

python script/extract_text_bbox_PDFminer.py [--output_pdf_path OUTPUT_PDF_PATH] input_pdf_path

Extract text bounding boxes (PyMuPDF)

python script/extract_text_bbox_PyMuPDF.py [--output_pdf_path OUTPUT_PDF_PATH] [--text_img] [--output_dir OUTPUT_DIR] input_pdf_path

Line Spacing

Analyze the line spacing using PDFminer

Criteria: Line height (line spacing) to at least 1.5 times the font size

WACG guideline: 1.4.12

Usage Instruction

python script/line_spacing.py input_pdf_path

WCAG Accessibility Issues Covered

Images: non-text image (i.e. icon, header), image of text
Text Presentation: line spacing, text-background contrast
Language: language of page, language of parts

PDF Language Detection

Examines the PDF's metadata for a specified language property using Langdetect and PyMuPDF.

Criteria: Assistive technology can determine the language of a page

WACG guideline: 3.1.1, 3.1.2

Basic usage:
```
python script/language_detection.py input_pdf_path
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Offline Document Asseessbility (PDF)

Installation

Features

Searchable PDF Creator

Text Alternatives for Non-text Content

Identify Non-text Content

Image of Text Classifier

Fine-tuned a image classification model

Image Captioning

Fine-tuned a Transformer model

OCR

Text Constrast

Overview

Fine-tuned a Transformer model

Calculate the contrast ratio

Line Spacing

WCAG Accessibility Issues Covered

PDF Language Detection

Files

README.md

Latest commit

History

README.md

File metadata and controls

Offline Document Asseessbility (PDF)

Installation

Features

Searchable PDF Creator

Text Alternatives for Non-text Content

Identify Non-text Content

Image of Text Classifier

Fine-tuned a image classification model

Image Captioning

Fine-tuned a Transformer model

OCR

Text Constrast

Overview

Fine-tuned a Transformer model

Calculate the contrast ratio

Line Spacing

WCAG Accessibility Issues Covered

PDF Language Detection