- damerauCmpDir.py : It compares the files in two folders, returning the normalized Damerau-Levenshtein distance for each common file. The Damerau-Levenshtein used is the developed by Geoffrey Fairchild and available at https://github.com/gfairchild/pyxDamerauLevenshtein.
- jaroCmpDir.py : It compares the files in two folders, returning the normalized Jaro-Winkler distance for each common file. The Jaro-Winkler implementation is the available at https://pypi.python.org/pypi/jellyfish.
- eqCmpDir.py : It compares the files in two folders, returning the percentage of words in file 1 which are also present in file 2.
- img2txt.py : Script which executes the OCRopy OCR process (Binarization, Segmentation, and Recognition). Please configure dirOcropy variable to indicate the OCRopus path.
- ocrFolder.py : Script which executes the img2txt (OCR) script to each jpg file available at the input folder.
Paper: Icaro Alzuru, Andréa Matsunaga, Maurício Tsugawa, and José A.B. Fortes, Cooperative Human-Machine Data Extraction from Biological Collections, 2016 IEEE 12th International Conference on eScience, 2016 IEEE 12th International Conference on e-Science (e-Science), Baltimore, MD, 2016, pp. 41-50. doi.org/10.1109/eScience.2016.7870884
License: Apache 2.0 (read License)
HuMaIN is funded by a grant from the National Science Foundation's ACI Division of Advanced Cyberinfrastructure (Award Number: 1535086). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.