Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting from PAGE to hocr creates double results #34

Open
Moarc opened this issue Apr 21, 2021 · 0 comments
Open

Converting from PAGE to hocr creates double results #34

Moarc opened this issue Apr 21, 2021 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@Moarc
Copy link

Moarc commented Apr 21, 2021

hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.

I start with an empty workspace, add an image to it, and run
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"
then I annotate it with
ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol
and finally, convert it to hocr
ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"

The resulting file has the words/segments doubled, and when fontshape is used - tripled.

@kba kba added the bug Something isn't working label Aug 31, 2021
@kba kba self-assigned this Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants