Converting from PAGE to hocr creates double results #34

Moarc · 2021-04-21T11:47:18Z

hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.

I start with an empty workspace, add an image to it, and run
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"
then I annotate it with
ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol
and finally, convert it to hocr
ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"

The resulting file has the words/segments doubled, and when fontshape is used - tripled.

The text was updated successfully, but these errors were encountered:

kba added the bug Something isn't working label Aug 31, 2021

kba self-assigned this Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting from PAGE to hocr creates double results #34

Converting from PAGE to hocr creates double results #34

Moarc commented Apr 21, 2021 •

edited

Loading

Converting from PAGE to hocr creates double results #34

Converting from PAGE to hocr creates double results #34

Comments

Moarc commented Apr 21, 2021 • edited Loading

Moarc commented Apr 21, 2021 •

edited

Loading