Skip to content

E-book accessibility metadata extraction in SANE environment

License

Notifications You must be signed in to change notification settings

KBNLresearch/SANE-ae

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contents of this repo

This repo contains a script that is meant to be used within the SANE environment for extracting accessibility info from e-books. It wraps around the rwp tool that is part of the Readium Go Toolkit. A Linux binary of rwp is included in this repo.

Currently only EPUB files (identified by the ".epub" file extension) are supported! Although the rwp tool also supports PDF, preliminary tests resulted in various problems, so PDF files are ignored in this version.

script.py

Get accessibility info from all Ebooks in a directory tree using Readium Go's Rwp tool. Output is wrapped in a TAR file, with file info in separate JSON as Rwp doesn't report this directly.

Input directory structure

The script assumes a flat directory structure, where the input dir contains 1 level of child directories, that each contain one Ebook. Here's an example:

./dirIn
├── IP1523369858048
│   └── 20161227153031_9789400403178.epub
├── IP1564040668928
│   └── 20190627233027_9789029582179.epub
└── IP1700049667584
    └── 20221207163015_9789044934458.epub

Command-line syntax

python3 script.py [-h] -i DIRIN -o DIROUT -t DIRTEMP
                 [-p PREFIXOUT]

With:

  • DIRIN: input directory
  • DIROUT: output directory
  • DIRTEMP: temporary file directory
  • PREFIXOUT: optional output prefix (default: "sane-ae")

Example:

python3 script.py -i ./SANE-AE-Sampleset -o ./testOut -t ./testTemp

Output

  • sane-ae.tar: TAR archive with output
  • sane-ae.log: log file

The output TAR contains one directory for each processed file. The name of each directory corresponds to the name of the direct parent directory of the imput file. Each directory contains the following files:

  • fileinfo.json: JSON file with, for each Ebook file, its name, full file path, the file format, the rwp version string, the full rwp command line and the rwp exit status
  • rwp.json: JSON file with output of the rwp tool

About

E-book accessibility metadata extraction in SANE environment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages