-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid structMap produced #1195
Comments
Much simpler way to reproduce: export OCRD_METS_CACHING=1
git clone https://github.com/OCR-D/assets
cd assets/data/SBB0000F29300010000/data
ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
eval declare -A XSD_PATHS=($(ocrd bashlib constants XSD_PATHS))
XSD_METS=${XSD_PATHS[$(ocrd bashlib constants XSD_METS_URL)]}
xmllint --schema $XSD_METS --noout mets.xml The culprit is the caching. |
Alright, so it's not ocrd-cis, but definitely a bug in core and a severe one! |
https://qurator-data.de/~mike.gerber/2024-02-core-issue-1195-invalid-structMap/
|
Workaround when running the ocrd/all:maximum image is disabling the caching by setting
e.g. by using the
(I'm currently trying to use the official image and this shell is what I am using for the time being.) |
Setting OCRD_METS_CACHING=0 fixes this, the structMap is valid after this binarization run. |
Note: As I understand it, this only happens with the ocrd/all Docker images, because they enable caching, but (unconfirmed) not when running the naked un-containerized ocrd_all. |
I found the cause: In this line … core/src/ocrd_models/ocrd_mets.py Line 779 in 3a1b3a2
setdefault ).
|
See new #1193 |
Note: we should also add test coverage for this. AFAICS, we need a cached |
Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this? |
Yes, OCR-D/ocrd_all#407
|
Does this test a simple workflow? I really don't see it? |
Note: as long as no v2.62.3 is released on PyPI, the bug is still not solved for users (ocrd/all being affected). |
This might be a bug in ocrd-cis actually, so beware.
We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:
ocrd workspace remove-group -rf
. → After this, the structMap is OK!ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
→ After this, the structMap is INVALIDInvalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:
(I'll upload the full data in the comments)
This causes all kind of breakage all over the place.
What I didn't check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with
add
too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.The text was updated successfully, but these errors were encountered: