Skip to content

1. Bounding Boxes

Saumya Shah edited this page Aug 14, 2018 · 1 revision

Overview

The bounding box algorithm is used to identify each entry in an image containing many such entries and successfully giving images that contain only one entry in it.

Bounding Boxes

The bounding box involves the use of Image Processing Techniques as well as a rough skeleton of the image being cropped. This rough skeleton can be provided using hOCR. hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML).

Using the hOCR output we will be able to determine the coordinates of the beginning and end of every entry. Based on that, we will be able to isolate each entry and crop it. Since all images have the same layout, the hOCR output for all images will be more or less the same. And hence we can run a batch process on all the images.

Issues

  1. hOCR output depends on each image. Hence the structure of the text may change with the image. We can only make a vague estimation.
  2. If there is external noise, in this case, stray characters from the previous or next page can interfere with the hOCR output. Cropping out the sides is necessary.
  3. Since the hOCR output is very volatile, we may not get one entry per image always. Some more rounds on the same image may be needed, or manual cropping.

Implementation

To take a look at the implemenation code, usage and output sample, click here.

Clone this wiki locally