-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark workflows "selected_pages_ocr" do not produce text results #22
Comments
Would be interesting to see the files of |
Here is
|
The QuiVer benchmark workflow selected_pages_ocr uses a process which binarizes twice. That gives an image which is too light for good OCR results (some characters are even missing completely). Nevertheless most of the text is still readable, to there should be some OCR result. |
All data is now available online. It also includes the generated page images, for example page 1 (binarized twice, denoised, deskewed). |
There are no TextLines to recognize text from, so this is expected. |
(I'm going on vacation in 2 hours so I'm not checking where the segmentation step is missing/going wrong, but I can check when I'm back) |
Commit 3b32589 removed a parameter for If that parameter is added again, some tests work fine, but others fail with a runtime error in |
Meanwhile I restored the line segmentation for the workflow and got OCR results at least for the tests where the segmentation process did not crash (see cisocrgroup/ocrd_cis#94). It looks like the segmentation of a single newspaper page takes several hours (the first one is now running for 252 minutes, see cisocrgroup/ocrd_cis#98). I am afraid that the whole workflow cannot be used in the benchmark tests because of that. |
The workflow selected_pages_ocr uses more than 118 GiB of RAM while running OCR with
|
That change is faulty btw: default is |
On which input data is that specifically? Your upload seems not to be up-to-date. I'd gladly reproduce and debug if I had the workspace including the segmentation used. The "configuration" used is workflows/ocrd_workflows/selected_pages_ocr.txt, I take it? |
That's right, selected_pages_ocr.txt is the workflow file. |
It would help if I had a workspace up to From the looks of it, @bertsky seems to be right (above) and the workflow still doesn't produce line segmentation (only region segmentation), so this behaviour would be even more curious. |
@stweil didn't we already establish (in the OCR-D Forum) that the version of ocrd_all used by Quiver at the time was hopelessly outdated? But I agree we should get to the bottom of this – with or without line segments, ocrd-calamari-recognize should not be allowed (or motivated) to allocate large amounts of memory. |
Yep. The way it works (line by line processing) it shouldn't happen, but a. I didn't test many newspaper pages myself and did that on a host with a lot of memory b. wouldn't be the first time to see a memory leak with TensorFlow. |
(Should probably run processors with ulimit or in a cgroup) |
Agreed! Could also be easily done in ocrd_all Docker images. Docker itself offers options like |
I have thoughts about this (for example, I don't think profile.d would work here), should we open an issue in ocrd_all then? Have to look into the "slim image" efforts anyway. |
It is still outdated, see issue #23. And I don't know whether there are plans and resources to change that. |
@mikegerber I added it to OCR-D/ocrd_all#280 – please add your ideas there. |
Because I didn't have to workspace to debug the memory problem involving ocrd-calamari-recognize, I tried to run the
Workspace at this point - if someone wants to have a look: https://qurator-data.de/~mike.gerber/2024-02-quiver-benchmarks-issue-22/reichsanzeiger_random_selected_pages_ocr.zip (Includes a At this point, I am not willing to look into this specific ocrd-calamari-recognize memory issue further, because I can't reproduce anything properly - it already involved guessing which original workspace it could have been and trying to run 7 processors. I am willing to look into it further, if I get the workspace in the state before ocrd-calamari-recognize ran, including OCR-D-SEG-LINE-RESEG-DEWARP. I'll test with some other segmentation in OCR-D/ocrd_calamari#110, just to make sure that there is no general issue. |
I am not sure that the images are binarized twice. It runs the binarization twice, yes, but the second binarization step may just use the original image but cropped, via AlternativeImage. @kba @bertsky It this correct? Is there a way to verify with the log? (In the ZIP in the comment above this) |
@mikegerber exactly. All binarization processors filter avoid images on the input side (via The log would only detail this if you were to enable debug loggers for |
The related workflows all end with CER / WER 1.0, so no text is recognized by Calamari.
A manual run for a single GT terminates in less than 1 second without error message, but also without a usable result:
The text was updated successfully, but these errors were encountered: