Document accessibility ensures equal access to information for everyone, regardless of their abilities. It not only helps foster a more inclusive society but is also an ethical responsibility to respect diverse audience needs. Besides, investing in document accessibility can benefit businesses by expanding their audience and demonstrating a commitment to ethical practices and social responsibility.
Here is a Python CLI tool designed to detect and address WCAG accessibility issues identified in offline documents, targeting PDF. It aims to promote a user-friendly experience for everyone and enables users to access documents with assistive technology.
git clone https://github.com/caraaaaa/doc_accessibility.git
conda env create -f env/environment.yml
conda activate doc_accessibility_cli
- Searchable PDF Creator
- Text Alternatives for Non-text Content
- Identify Non-text Content
- Generate Description for Non-text Content
- Text Representation
Convert scanned PDF document to searchable format using TesseractOCR.
Criteria: User can read or extract the words using assistive technologies, or manipulate the PDF for accessibility.
WACG guideline: 1.4.5
Scanned PDF | Searchable PDF |
---|---|
![]() |
![]() |
Usage Instruction
python script/scanned2searchable.py [-o OUTPUT_PDF_PATH] [-s] input_pdf_path
Default output path: readable_pdf.pdf
optional arguments:
-o OUTPUT_PDF_PATH, --output_pdf_path OUTPUT_PDF_PATH
The path for the output searchable PDF file.
-s, --show_result Show text of the searchable PDF after OCR
Provide descriptions for images inside searchable PDF using image classification, transformer model and OCR.
Criteria: All non-text content (e.g. images, formulas) that is presented to the user has a text alternative that serves the equivalent purpose
WACG guideline: 1.1
Overview
flowchart LR
A[Extract Images] -->|Optional| B[Save Images]
A --> C[Check Alt Text]
C --> D[Classification]
D -->|image of text| F[Perform OCR]
D -->|non-text image| G[Generate Image Caption]
F --> H[Provide Alt Text Suggestions]
G --> H
H -->|Optional| I[Output PDF with Bounding Box]
Usage Instruction
python script/extract_PDF_image.py [--output_img] [--output_folder OUTPUT_FOLDER] [--draw_bbox] [--output_pdf_path OUTPUT_PDF_PATH] [--captioning] input_pdf_path
optional arguments:
--output_img Output images extracted from the PDF.
--output_folder OUTPUT_FOLDER
The directory for the output images.
--draw_bbox Output PDF with bounding box on images.
--output_pdf_path OUTPUT_PDF_PATH
The path for the output PDF file with bounding box.
--captioning Generate caption for images.
Default output folder: pdf_image
Default output pdf path: image_bbox.pdf
Part of Text Alternatives for Non-text Content
Identify images which without alternative text using PyMuPDF and Pillow.
![]() |
![]() |
---|
Usage Instruction
python script/extract_PDF_image.py --draw_bbox input_pdf_path
Default output path: image_bbox.pdf
Part of Text Alternatives for Non-text Content
Classifiy if an image (such as JPG or PNG) primarily contains text before performing OCR or image captioning.
Image of Text | Non-text image |
---|---|
![]() |
![]() |
- Data: online-sourced images
- Model: ResNet
- Tools: fastai
- Training Script: Google Colab
Usage Instruction
python script/image_of_text.py [--show_score] input_pdf_path
Optional arguments:
--show_score Show the classification score
Part of Text Alternatives for Non-text Content
Generate descriptive caption for non-text image using Transformer model.
Non-text image | Caption |
---|---|
![]() |
Indication of correct signature |
-
Data:
- Non-text images extracted from PDF
- Captioned Manually
- Dataset card
-
Model:
- Pre-trained: GenerativeImage2Text
- Fined-tuned: model card and inference API
-
Tools: HuggingFace Transformer, PyTorch
-
Training Script: Google Colab
Usage Instruction
python script/generate_caption.py <input_image_path>
Part of Text Alternatives for Non-text Content
Extract text from image-of-text using TesseractOCR
Usage Instruction
python script/image_of_text.py input_pdf_path
Identify low contrast text inside PDF using image segmentation and contrast ratio analysis.
Criteria: The visual presentation of text and images of text has a contrast ratio of at least 4.5:1
WACG guideline: 1.4.3
flowchart LR
A[Extract Text Bounding Box]
A --> C[Image Segmentation] -->|Optional| B[Save model prediction]
C --> D[Calculate contrast]
D -->|Optional| E[Output PDF with Bounding Box]
Image of Text | Predicted Text Segmentation | Enought Constrast | Low Contrast |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
- Data:
- Generate synthetic image of text with segmentation mask using Pillow
- Dataset Card
- Model:
- Pre-trained: SegFormer
- Fined-tuned: model card
- Tools: HuggingFace Transformer, PyTorch
- Training Script: Google Colab
Calculate the contrast ratio
- Luminance (brightness) of the colors:
-
Contrast Ratio
$$\frac{L2+0.05}{L1+0.05}$$
Usage Instruction
- Generate synthetic image of text
python script/synthetic_text_seg.py [--sample_no SAMPLE_NO] [--output_folder OUTPUT_FOLDER] [--font_folder FONT_FOLDER]
Default output folder: image_of_text
Default font input folder: font
- Identify low contrast text
python script/contrast_pdf.py [--output_bbox_img] [--output_dir OUTPUT_DIR] [--draw_bbox] [--output_pdf_path OUTPUT_PDF_PATH] [--bbox_extractor {PyMyPDF,pdfminer}] input_pdf_path
optional arguments:
--output_bbox_img Option to save text block images with low contrast.
--output_dir OUTPUT_DIR
The directory for output images.
--draw_bbox Option to draw bounding boxes on low contrast text blocks.
--output_pdf_path OUTPUT_PDF_PATH
The path for the output PDF file with drawn bounding boxes.
--bbox_extractor {PyMyPDF,pdfminer}
Choice of bounding box extractor.
Default output folder: segmentation_model_output
Default output path with bounding box: bbox_low_contrast.pdf
- Extract text bounding boxes (PDFMiner)
python script/extract_text_bbox_PDFminer.py [--output_pdf_path OUTPUT_PDF_PATH] input_pdf_path
- Extract text bounding boxes (PyMuPDF)
python script/extract_text_bbox_PyMuPDF.py [--output_pdf_path OUTPUT_PDF_PATH] [--text_img] [--output_dir OUTPUT_DIR] input_pdf_path
Analyze the line spacing using PDFminer
Criteria: Line height (line spacing) to at least 1.5 times the font size
WACG guideline: 1.4.12
Usage Instruction
python script/line_spacing.py input_pdf_path
-
Images: non-text image (i.e. icon, header), image of text
-
Text Presentation: line spacing, text-background contrast
-
Language: language of page, language of parts
Examines the PDF's metadata for a specified language property using Langdetect and PyMuPDF.
Criteria: Assistive technology can determine the language of a page
Basic usage:
python script/language_detection.py input_pdf_path