Skip to content

A Python CLI tool designed to detect and address WCAG accessibility issues identified in offline documents. It aims to help disabilities have equal access to document.

Notifications You must be signed in to change notification settings

caraaaaa/doc_accessibility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Offline Document Asseessbility (PDF)

Document accessibility ensures equal access to information for everyone, regardless of their abilities. It not only helps foster a more inclusive society but is also an ethical responsibility to respect diverse audience needs. Besides, investing in document accessibility can benefit businesses by expanding their audience and demonstrating a commitment to ethical practices and social responsibility.

Here is a Python CLI tool designed to detect and address WCAG accessibility issues identified in offline documents, targeting PDF. It aims to promote a user-friendly experience for everyone and enables users to access documents with assistive technology.

Installation

git clone https://github.com/caraaaaa/doc_accessibility.git
conda env create -f env/environment.yml
conda activate doc_accessibility_cli

Features

Searchable PDF Creator

Convert scanned PDF document to searchable format using TesseractOCR.

Criteria: User can read or extract the words using assistive technologies, or manipulate the PDF for accessibility.

WACG guideline: 1.4.5

Scanned PDF Searchable PDF
Usage Instruction
python script/scanned2searchable.py [-o OUTPUT_PDF_PATH] [-s] input_pdf_path

Default output path: readable_pdf.pdf

optional arguments:
  -o OUTPUT_PDF_PATH, --output_pdf_path OUTPUT_PDF_PATH
                        The path for the output searchable PDF file.
  -s, --show_result     Show text of the searchable PDF after OCR

Text Alternatives for Non-text Content

Provide descriptions for images inside searchable PDF using image classification, transformer model and OCR.

Criteria: All non-text content (e.g. images, formulas) that is presented to the user has a text alternative that serves the equivalent purpose

WACG guideline: 1.1

Overview

flowchart LR
    A[Extract Images] -->|Optional| B[Save Images]
    A --> C[Check Alt Text]
    C --> D[Classification]
    D -->|image of text| F[Perform OCR]
    D -->|non-text image| G[Generate Image Caption]
    F --> H[Provide Alt Text Suggestions]
    G --> H
    H -->|Optional| I[Output PDF with Bounding Box]
Loading
Usage Instruction
python script/extract_PDF_image.py [--output_img] [--output_folder OUTPUT_FOLDER] [--draw_bbox] [--output_pdf_path OUTPUT_PDF_PATH] [--captioning] input_pdf_path
optional arguments:
  --output_img          Output images extracted from the PDF.
  --output_folder OUTPUT_FOLDER
                        The directory for the output images.
  --draw_bbox           Output PDF with bounding box on images.
  --output_pdf_path OUTPUT_PDF_PATH
                        The path for the output PDF file with bounding box.
  --captioning          Generate caption for images.

Default output folder: pdf_image Default output pdf path: image_bbox.pdf

Identify Non-text Content

Part of Text Alternatives for Non-text Content

Identify images which without alternative text using PyMuPDF and Pillow.

Usage Instruction
python script/extract_PDF_image.py --draw_bbox input_pdf_path

Default output path: image_bbox.pdf

Image of Text Classifier

Part of Text Alternatives for Non-text Content

Classifiy if an image (such as JPG or PNG) primarily contains text before performing OCR or image captioning.

Image of Text Non-text image

Fine-tuned a image classification model

  • Data: online-sourced images
  • Model: ResNet
  • Tools: fastai
  • Training Script: Google Colab
Usage Instruction
python script/image_of_text.py [--show_score] input_pdf_path
Optional arguments:
  --show_score    Show the classification score

Image Captioning

Part of Text Alternatives for Non-text Content

Generate descriptive caption for non-text image using Transformer model.

Non-text image Caption
Indication of correct signature

Fine-tuned a Transformer model

Usage Instruction
python script/generate_caption.py <input_image_path>

OCR

Part of Text Alternatives for Non-text Content

Extract text from image-of-text using TesseractOCR

Usage Instruction
python script/image_of_text.py input_pdf_path

Text Constrast

Identify low contrast text inside PDF using image segmentation and contrast ratio analysis.

Criteria: The visual presentation of text and images of text has a contrast ratio of at least 4.5:1

WACG guideline: 1.4.3

Overview

flowchart LR
    A[Extract Text Bounding Box]
    A --> C[Image Segmentation] -->|Optional| B[Save model prediction]
    C --> D[Calculate contrast]
    D -->|Optional| E[Output PDF with Bounding Box]
Loading
Image of Text Predicted Text Segmentation Enought Constrast Low Contrast

Fine-tuned a Transformer model

Calculate the contrast ratio

  • Luminance (brightness) of the colors:

$$L = 0.2126\times R+0.7152\times G+0.0722\times B$$

$$\small\text{where R, G, and B are normalized to 0-1}$$

  • Contrast Ratio $$\frac{L2+0.05}{L1+0.05}$$

$$\small\text{where L1 is the luminance of the lighter color, either text or background}$$

Usage Instruction
  • Generate synthetic image of text
python script/synthetic_text_seg.py [--sample_no SAMPLE_NO] [--output_folder OUTPUT_FOLDER] [--font_folder FONT_FOLDER]

Default output folder: image_of_text

Default font input folder: font

  • Identify low contrast text
python script/contrast_pdf.py [--output_bbox_img] [--output_dir OUTPUT_DIR] [--draw_bbox] [--output_pdf_path OUTPUT_PDF_PATH] [--bbox_extractor {PyMyPDF,pdfminer}] input_pdf_path
optional arguments:
  --output_bbox_img     Option to save text block images with low contrast.
  --output_dir OUTPUT_DIR
                        The directory for output images.
  --draw_bbox           Option to draw bounding boxes on low contrast text blocks.
  --output_pdf_path OUTPUT_PDF_PATH
                        The path for the output PDF file with drawn bounding boxes.
  --bbox_extractor {PyMyPDF,pdfminer}
                        Choice of bounding box extractor.

Default output folder: segmentation_model_output

Default output path with bounding box: bbox_low_contrast.pdf

  • Extract text bounding boxes (PDFMiner)
python script/extract_text_bbox_PDFminer.py [--output_pdf_path OUTPUT_PDF_PATH] input_pdf_path
  • Extract text bounding boxes (PyMuPDF)
python script/extract_text_bbox_PyMuPDF.py [--output_pdf_path OUTPUT_PDF_PATH] [--text_img] [--output_dir OUTPUT_DIR] input_pdf_path

Line Spacing

Analyze the line spacing using PDFminer

Criteria: Line height (line spacing) to at least 1.5 times the font size

WACG guideline: 1.4.12

Usage Instruction
python script/line_spacing.py input_pdf_path

WCAG Accessibility Issues Covered

  • Images: non-text image (i.e. icon, header), image of text

  • Text Presentation: line spacing, text-background contrast

  • Language: language of page, language of parts

    PDF Language Detection

    Examines the PDF's metadata for a specified language property using Langdetect and PyMuPDF.

    Criteria: Assistive technology can determine the language of a page

    WACG guideline: 3.1.1, 3.1.2

    Basic usage:

    python script/language_detection.py input_pdf_path
    

About

A Python CLI tool designed to detect and address WCAG accessibility issues identified in offline documents. It aims to help disabilities have equal access to document.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages