Skip to content

acislab/HuMaIN_Collaborative_Data_Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cooperative Human-Machine Data Extraction from Biological Collections

Scripts developed for the experiments of the study:

  • damerauCmpDir.py : It compares the files in two folders, returning the normalized Damerau-Levenshtein distance for each common file. The Damerau-Levenshtein used is the developed by Geoffrey Fairchild and available at https://github.com/gfairchild/pyxDamerauLevenshtein.
  • jaroCmpDir.py : It compares the files in two folders, returning the normalized Jaro-Winkler distance for each common file. The Jaro-Winkler implementation is the available at https://pypi.python.org/pypi/jellyfish.
  • eqCmpDir.py : It compares the files in two folders, returning the percentage of words in file 1 which are also present in file 2.
  • img2txt.py : Script which executes the OCRopy OCR process (Binarization, Segmentation, and Recognition). Please configure dirOcropy variable to indicate the OCRopus path.
  • ocrFolder.py : Script which executes the img2txt (OCR) script to each jpg file available at the input folder.

Paper: Icaro Alzuru, Andréa Matsunaga, Maurício Tsugawa, and José A.B. Fortes, Cooperative Human-Machine Data Extraction from Biological Collections, 2016 IEEE 12th International Conference on eScience, 2016 IEEE 12th International Conference on e-Science (e-Science), Baltimore, MD, 2016, pp. 41-50. doi.org/10.1109/eScience.2016.7870884

License: Apache 2.0 (read License)

Acknowledgement

HuMaIN is funded by a grant from the National Science Foundation's ACI Division of Advanced Cyberinfrastructure (Award Number: 1535086). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

About

Code related to the Collaborative Data Extration paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages