Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid structMap produced #1195

Closed
mikegerber opened this issue Mar 1, 2024 · 15 comments · Fixed by #1193
Closed

Invalid structMap produced #1195

mikegerber opened this issue Mar 1, 2024 · 15 comments · Fixed by #1193

Comments

@mikegerber
Copy link
Contributor

mikegerber commented Mar 1, 2024

This might be a bug in ocrd-cis actually, so beware.

We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:

  1. Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
  2. I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using ocrd workspace remove-group -rf. → After this, the structMap is OK!
  3. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
        <mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
      </mets:div>

       ...

      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
      </mets:div>
  
       ...

    </mets:div>
  </mets:structMap>

(I'll upload the full data in the comments)

This causes all kind of breakage all over the place.

What I didn't check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

Much simpler way to reproduce:

export OCRD_METS_CACHING=1
git clone https://github.com/OCR-D/assets
cd assets/data/SBB0000F29300010000/data
ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
eval declare -A XSD_PATHS=($(ocrd bashlib constants XSD_PATHS))
XSD_METS=${XSD_PATHS[$(ocrd bashlib constants XSD_METS_URL)]}
xmllint --schema $XSD_METS --noout mets.xml

The culprit is the caching.

@mikegerber
Copy link
Contributor Author

Alright, so it's not ocrd-cis, but definitely a bug in core and a severe one!

@mikegerber
Copy link
Contributor Author

(I'll upload the full data in the comments)

https://qurator-data.de/~mike.gerber/2024-02-core-issue-1195-invalid-structMap/

reichsanzeiger_random_selected_pages_ocr.2-after-ocrd-cis-ocropy-binarize.zip is the one with broken METS, after step 2 above.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

Absolutely.

And #1193 does not help (yet). It must be the changes to OcrdMets around #1063, though.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

It must be the changes to OcrdMets around #1063, though.

Confirmed by bisection. Somewhere here we broke the cache validity.

@mikegerber
Copy link
Contributor Author

Workaround when running the ocrd/all:maximum image is disabling the caching by setting OCRD_METS_CACHING=0:

$ env | grep OCRD_METS_CACHING
OCRD_METS_CACHING=0

e.g. by using the docker CLI's --env like this:

$ cat run-ocrd_all-maximum.sh
#!/bin/sh
docker run \
        --user $(id -u) --workdir /data --volume $PWD:/data \
        --volume ocrd-models:/models \
        --env OCRD_METS_CACHING=0 \
        --rm -it \
        ocrd/all:maximum bash

(I'm currently trying to use the official image and this shell is what I am using for the time being.)

@mikegerber
Copy link
Contributor Author

mikegerber commented Mar 1, 2024

  1. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Setting OCRD_METS_CACHING=0 fixes this, the structMap is valid after this binarization run.

@mikegerber
Copy link
Contributor Author

mikegerber commented Mar 1, 2024

Note: As I understand it, this only happens with the ocrd/all Docker images, because they enable caching, but (unconfirmed) not when running the naked un-containerized ocrd_all.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

I found the cause: In this line …

self._fptr_cache[pageId] = {}
… the fptr cache gets deleted if there was nothing in the page cache for that pageId already (instead of a setdefault).

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

See new #1193

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.

@mikegerber
Copy link
Contributor Author

Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.

Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?

@kba
Copy link
Member

kba commented Mar 5, 2024

Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?

Yes, OCR-D/ocrd_all#407

Note: we should also add test coverage for this. AFAICS, we need a cached add_file under the circumstance that a file for that page already exists, but in a different fileGrp.

@mikegerber
Copy link
Contributor Author

Wouldn't a simple integration test (running a simple bin + seg + ocr workflow) using the ocrd_all images also have caught this?

Yes, OCR-D/ocrd_all#407

Does this test a simple workflow? I really don't see it?

@bertsky
Copy link
Collaborator

bertsky commented Mar 8, 2024

Note: as long as no v2.62.3 is released on PyPI, the bug is still not solved for users (ocrd/all being affected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants