Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation on raw images #144

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Aug 24, 2020

No description provided.

filter out binarized images (independent of the workflow), to
improve segmentation quality
@bertsky bertsky marked this pull request as draft August 24, 2020 18:04
@bertsky bertsky requested review from kba and wrznr August 24, 2020 18:04
@bertsky bertsky added enhancement New feature or request help wanted Extra attention is needed labels Aug 24, 2020
@codecov
Copy link

codecov bot commented Aug 24, 2020

Codecov Report

Merging #144 into master will increase coverage by 0.04%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #144      +/-   ##
==========================================
+ Coverage   37.73%   37.77%   +0.04%     
==========================================
  Files           9        9              
  Lines        1023      998      -25     
  Branches      216      212       -4     
==========================================
- Hits          386      377       -9     
+ Misses        565      555      -10     
+ Partials       72       66       -6     
Impacted Files Coverage Δ
ocrd_tesserocr/crop.py 13.51% <ø> (+0.78%) ⬆️
ocrd_tesserocr/segment_line.py 63.63% <ø> (-8.68%) ⬇️
ocrd_tesserocr/segment_region.py 53.64% <ø> (+4.21%) ⬆️
ocrd_tesserocr/segment_table.py 0.00% <0.00%> (ø)
ocrd_tesserocr/recognize.py 47.75% <0.00%> (-1.00%) ⬇️
ocrd_tesserocr/binarize.py 22.95% <0.00%> (+1.63%) ⬆️
ocrd_tesserocr/deskew.py 17.34% <0.00%> (+1.88%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24b7ced...2b3e8d6. Read the comment docs.

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 24, 2020

This needs to be tested systematically. I expect to see both degradation and improvement, depending on how hard binarization is. See here for explanation.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasoning, subscribing to tesseract-ocr/tesseract#3083 for the discussion on upstream changes. Changeset (filtering binarized) is sensible but needs good testing to ensure that it is more beneficial than detrimental, or perhaps should be parameterizable.

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 25, 2020

or perhaps should be parameterizable.

I thought about that, but at workflow configuration time, you have next to no chance of knowing which is going to be better. (I would guess that only input images which fare well under global Otsu are better off with the change. But we have no automatic indicator of binarization quality yet. In the very least, we should strive for some estimator based on local distribution of connected component statistics.)

But I still hope that we can fix the problem in Tesseract itself.

@bertsky bertsky force-pushed the segment-filter-binarized branch from d231edb to 2b3e8d6 Compare October 1, 2020 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants