Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use better tesseract training dataset #459

Merged
merged 2 commits into from
Dec 9, 2024
Merged

Conversation

jonchang
Copy link
Collaborator

@jonchang jonchang commented Dec 5, 2024

Description

Uses a better tesseract training dataset.

By default, Debian (and Ubuntu) use the tesseract_fast training data for their repositories. https://github.com/AlexanderP/tesseract-lang-debian/blob/HEAD/debian/upstream/metadata

This pull request downloads the tesseract_best dataset, which is slightly larger and has a roughly 2x performance penalty, but in practice this isn't a big deal (5 mb -> 12 mb, 0.75 s -> 1.5 s).

Also update to tesseract v5 and make some changes to slim down the docker image.

Related Issues

Part of #422. This was used to generate benchmark metrics for comparisons locally, so this pull request will port those changes to the deployed image.

Also part of #412. Moving to python-slim (and cleaning up poetry caches) had the effect of dropping the base image size from 3gb to 2gb.

Checklist

  • The title of this PR is descriptive and concise.
  • My changes follow the style guidelines of this project.
  • I have added or updated test cases to cover my changes.
  • I've let the team know about this PR by linking it in the review channel

This should improve OCR performance using tesseract with only a mild
increase in runtime and container size.
@jonchang jonchang force-pushed the use-better-tesseract-data branch from 164cb20 to b888f33 Compare December 6, 2024 22:35
@jonchang jonchang marked this pull request as ready for review December 6, 2024 23:33
Copy link
Collaborator

@arinkulshi-skylight arinkulshi-skylight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome looks good!

@jonchang jonchang added this pull request to the merge queue Dec 9, 2024
Merged via the queue into main with commit c8d2b39 Dec 9, 2024
2 checks passed
@jonchang jonchang deleted the use-better-tesseract-data branch December 9, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants