Use better tesseract training dataset #459

jonchang · 2024-12-05T19:47:50Z

Description

Uses a better tesseract training dataset.

By default, Debian (and Ubuntu) use the tesseract_fast training data for their repositories. https://github.com/AlexanderP/tesseract-lang-debian/blob/HEAD/debian/upstream/metadata

This pull request downloads the tesseract_best dataset, which is slightly larger and has a roughly 2x performance penalty, but in practice this isn't a big deal (5 mb -> 12 mb, 0.75 s -> 1.5 s).

Also update to tesseract v5 and make some changes to slim down the docker image.

Related Issues

Part of #422. This was used to generate benchmark metrics for comparisons locally, so this pull request will port those changes to the deployed image.

Also part of #412. Moving to python-slim (and cleaning up poetry caches) had the effect of dropping the base image size from 3gb to 2gb.

Checklist

The title of this PR is descriptive and concise.
My changes follow the style guidelines of this project.
I have added or updated test cases to cover my changes.
I've let the team know about this PR by linking it in the review channel

This should improve OCR performance using tesseract with only a mild increase in runtime and container size.

arinkulshi-skylight

Awesome looks good!

This was referenced Dec 5, 2024

Generate metrics for comparison between tesseract and tr-ocr #422

Closed

Reduce the size of the OCR Docker image #412

Open

jonchang force-pushed the use-better-tesseract-data branch 2 times, most recently from 599cca6 to 90e730c Compare December 6, 2024 21:12

jonchang added 2 commits December 6, 2024 14:35

Download tessdata-best instead of tessdata-fast

0292dfe

This should improve OCR performance using tesseract with only a mild increase in runtime and container size.

Drop unused cdifflib

b888f33

jonchang force-pushed the use-better-tesseract-data branch from 164cb20 to b888f33 Compare December 6, 2024 22:35

jonchang marked this pull request as ready for review December 6, 2024 23:33

arinkulshi-skylight self-requested a review December 9, 2024 17:27

arinkulshi-skylight approved these changes Dec 9, 2024

View reviewed changes

jonchang added this pull request to the merge queue Dec 9, 2024

Merged via the queue into main with commit c8d2b39 Dec 9, 2024
2 checks passed

jonchang deleted the use-better-tesseract-data branch December 9, 2024 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use better tesseract training dataset #459

Use better tesseract training dataset #459

jonchang commented Dec 5, 2024 •

edited

Loading

arinkulshi-skylight left a comment

Use better tesseract training dataset #459

Use better tesseract training dataset #459

Conversation

jonchang commented Dec 5, 2024 • edited Loading

Description

Related Issues

Checklist

arinkulshi-skylight left a comment

Choose a reason for hiding this comment

jonchang commented Dec 5, 2024 •

edited

Loading