Skip to content

Latest commit

 

History

History
268 lines (199 loc) · 9.88 KB

README.rst

File metadata and controls

268 lines (199 loc) · 9.88 KB

RescueOMR: batch Optical Mark Recognition without foresight

RescueOMR is a set of tools to perform batch OMR (Optical Mark Recognition, ie: detect check-marks on paper questionnaires) on a loose set of scanned documents.

RescueOMR uses image-matching to identify and locate the checkmarks to be analyzed, without any document preparation. This allows for printing defects, page and even document variations to be accounted for without foresight.

The checkmark structure can be reconstructed directly from the final print using just a sample page, Gimp and Inkscape.

Read the usage tutorial for an hands-on demonstration.

RescueOMR consists of two main tools used together in sequence:

extractmpl:locate a template within a larger image and return a registered image that can be used for OMR
simpleomr:given a registered image and checkbox locations, returns the status (ticked/unticked) of each in tabular format

In the simplest usage scenario, the following illustrates how a new questionnaire is prepared and then processed:

  1. Scan a page of the questionnaire containing the checkboxes at 300dpi in grayscale. Open the image with an editor such as Gimp.
  2. With the aid of guides, correct the image (rotate and deskew as needed) so that checkboxes become square and straight.
  3. Crop the area around the checkboxes, leaving some surrounding detail.
  4. Cleanup any pen marks, so that checkboxes become empty.
  5. Save the resulting image as template.png
  1. Open the template (template.png) with Inkscape.
  2. Loosely draw a rectangle around each checkbox, trying to use the same size for all.
  3. Assign an unique ID to each drawn rectangle (right click -> Object properties ... -> ID).
  4. Save the resulting file as checkboxes.svg.

Scan all the pages in bulk at 300 dpi, grayscale. Process each page using the following pseudo-code:

for page in $pages
do
  # find the template in the page
  ./bin/extractmpl template.png $page registered.png
  if [ $? == 0 ]
  then
    # template found, extract checkbox values
    ./bin/simpleomr checkboxes.svg registered.png > $page.txt
  fi
done

The numerical values of the checkmarks in each page will be written into corresponding text files.

Calling extractmpl multiple times using different templates will allow to detect different questionnaires and do selective processing of the checkmarks just by checking its exit status.

Traditional OMR software require questionnaires to be prepared with dedicated tools, or to add additional formatting/barcodes/marks for page registration. Processing a document that was not initially prepared for OMR is often hard or results in a very high error recognition rate even when using high-end printing and scanning equipment.

RescueOMR bypasses the image registration requirement by using image-matching directly. No special tools are required: a cleaned sample of the region to match is all it's needed. By having no assumption on the underlying page structure, RescueOMR can easily find similar regions in different page layouts, often without multiple templates.

RescueOMR is generally not intended to substitute existing software when designing a new questionnaire from scratch. Software such as SDAPS or queXF offer integrated questionnaire design tools that can handle the entire processing workflow, from design to answer validation.

RescueOMR is also not a full OMR suite: it's mainly designed to aid in digitalization of pre-existing paper questionnaires, often accompanied by custom scripting and validation.

RescueOMR settings and tolerances are tuned for 300 dpi gray-scale documents only. When scanning your questionnaires, be sure to set the scanning parameters to 300 dpi gray-scale, possibly without compression.

RescueOMR requires the following software packages:

  • Python 3
  • Python Imaging Library (PIL) (python3-pil)
  • Python lxml (python3-lxml)
  • NumPy (python3-numpy)
  • SciPy (python3-scipy)
  • scikit-image (python3-skimage)

Under Debian/Ubuntu, install the required dependencies with:

sudo apt-get install python3-pil python3-lxml python3-numpy
sudo apt-get install python3-scipy python3-skimage

extractmpl works by locating edge features in the template that are also present inside the page. Details in the page which do not exist in the template are simply ignored (that is: extractmpl does a one-way search of the features contained in the template).

In Template preparation we outline two important details which are a consequence of these rules.

We instruct to crop the image to the area surrounding the checkboxes: we want the text nearby the checkbox to match exactly a single time in the page. Text surrounding the checkboxes is generally unique, while checkboxes by themselves are not. Cropping the template exactly around a single checkbox would match any checkbox in the page.

The template must be the smallest, visually unique area in the page (and more generally, unique in all the possible pages to be discerned).

Once an appropriate template region has been located, we want to ignore details which might change inside the page. As a consequence of the one-way match, we can simply remove any detail from the template that we want to ignore: if the page has been scribbled, simply remove any pen mark using an image editor. If there's a variable code in the header of a page that we want to ignore, just clear the code from the template.

This behavior however has the unintended side-effect that a page can potentially contain unwanted additional features in any empty area of the template and still match. In such cases it's then necessary to enlarge the template to include some other unique feature. In practice though, such scenario is very unlikely: careful template preparation becomes an issue only when similar pages are being analyzed, and it's otherwise a trivial process.

extractmpl looks for the template in the entire page, allowing for a certain degree of deformation. Searching for the template is the slowest process by a large margin, but it can be sped-up by restricting the search area to a smaller region using the -r [region] flag. The specified region, specified in pixel coordinates from the top-left corner of the page, must always contain the template completely. As such, provide for a generous border around the expected position to account for all page shifts and distortions caused by the printer+scanner combination.

extractmpl will output the rectified area matching the template inside the page. That is, the output image will contain the checkboxes of each page at exactly the same coordinates as the template.

You can process the output directly yourself, or use simpleomr immediately in sequence. simpleomr contains ad-hoc logic to handle square checkboxes, with support for both simple marks, ticks, X-es, and filling.

The checkbox positions can be constructed conveniently using Inkscape. Create a new document, and import a single image: the template itself. Draw a single rectangle over each checkbox. Use the rectangle tool only: do not use any other tool, group or transform. Guides can be used for alignment.

When drawing the rectangle, you should go around the checkbox borders, leaving some internal padding (30-50% of the checkbox size is a good rule). simpleomr ignores perfectly orthogonal lines, effectively masking the checkbox borders automatically and accounting for any scribble going outside the designated checkbox area (which is very common).

The rectangles in the same template should preferably have all the same sizes for best results.

Assign an unique ID to each rectangle (right click -> Object properties ... -> ID) that will become the key in the output table. Save the resulting file as an Inkscape SVG file or as a Plain SVG file.

For each checkbox in the SVG file simpleomr will output a simple tab-separated table of the form:

ID value
ID value
... ...

Where each ID is assigned to a rectangle via Inkscape, and value can be:

-1:Unknown state
0:Empty
1:Checked
2:Filled

simpleomr can also output an additional debugging image using the -d [file] flag. Such image shows how each pixel/location is considered by simpleomr:

Blue areas:pixel constituent
Yellow rect:unknown
Green rect:empty
Red rect:checked
Gray rect:filled
Other pixels:ignored

simpleomr has no notion of the checkbox semantics in the template. As such, it simply returns the fill status (empty/checked/filled) of each box. When pairs of checkboxes are used, as typical in a yes/no scenario, post-processing is often required for validating the true answer.

RescueOMR can be found at https://www.thregr.org/wavexx/software/RescueOMR/

RescueOMR is distributed under GNU AGPLv3+, WITHOUT ANY WARRANTY.
Copyright(c) 2016-2017: Yuri D'Elia <[email protected]>
Copyright(c) 2016-2017: EURAC, Institute of Genetic Medicine

RescueOMR's GIT repository is publicly accessible at:

https://github.com/EuracBiomedicalResearch/RescueOMR