-
Notifications
You must be signed in to change notification settings - Fork 7
Workspace Bulk Add
Let's assume you have directories path/to/files/PAGE
containing PAGE files and path/to/files/IMG
with images. The files have a basename page_0001.xml
, page_0001.tif'
etc.
ocrd workspace bulk-add \
--regex '^.*/(?P<fileGrp>[^/]+)/page_(?P<pageid>.*)\.(?P<ext>[^\.]*)$' \
--file-id 'FILE_{{ fileGrp }}_{{ pageid }}' \
--page-id 'PHYS_{{ pageid }}' \
--file-grp "{{ fileGrp }}" \
--url '{{ fileGrp }}/FILE_{{ pageid }}.{{ ext }}' \
'path/to/files/*/*.*'
This will first expand the glob to get filenames and resolve them to absolute paths.
Every path is then matched against --regex
with re.match
, yielding template variables derived from the syntax of the path. These template variables can be used in all file-specific options. --url
after expansion is used as the filename relative to the workspace directory and copied into the workspace if not already present. After expanding all template variables, the file is added with Workspace.add_file
.
In this case:
-
path/to/files/PAGE/page_0001.xml
->-
url
:PAGE/FILE_0001.xml
(will be copied because file name is different) -
fileGrp
:PAGE
-
ID
:FILE_0001
-
pageId
:PHYS_0001
-
--mimetype
, if not provided, is mapped from the file extension.
--ignore
will disable the check for existing files with the same @ID
and is a huge performance boost.
If the FILE_GLOB
is a single dash -
, the file path list is read from STDIN, so you can pass in data about the files to be added in a simple space-separated list of values:
{ echo PHYS_0001 BIN FILE_0001_BIN.IMG-wolf BIN/FILE_0001_BIN.IMG-wolf.png; \
echo PHYS_0001 BIN FILE_0001_BIN BIN/FILE_0001_BIN.xml; \
echo PHYS_0002 BIN FILE_0002_BIN.IMG-wolf BIN/FILE_0002_BIN.IMG-wolf.png; \
echo PHYS_0002 BIN FILE_0002_BIN BIN/FILE_0002_BIN.xml; \
} | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<filegrp>.*) (?P<fileid>.*) (?P<url>.*)' \
-G '{{ filegrp }}' -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' -
This allows users to prepare the data to be added semi-manually as a CSV file, which works particularly well for cases where the naming convention of the files to be added is not consistent or informative enough for relying just on the filenames for pattern matching.
For example, to import the old (first-generation zip-file) OCR-D GT directories, one could then do:
# in a directory where all zip-files have been extracted already:
for book in */; do
pushd $book
ocrd workspace init
ocrd workspace set-id $book
# only images, no copying
ocrd workspace bulk-add \
--skip \
--regex '^(?P<dispname>[^/]*)/(?P=dispname)_(?P<pageid>[0-9]*)\.tif$' \
--file-id 'FILE_ORIG_{{ pageid }}'
--page-id 'PHYS_{{ pageid }}' \
--file-grp OCR-D-IMG \
--url '{{ dispname }}_{{ pageid }}.tif' \
$(find . -name "*.tif")
# only PAGE, no copying
ocrd workspace bulk-add \
--skip \
--regex '^(?P<dispname>[^/]*)/page/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
--file-id 'FILE_GT_{{ pageid }}' \
--page-id 'PHYS_{{ pageid }}' \
--file-grp OCR-D-GT-SEG-PAGE \
--url 'page/{{ dispname }}_{{ pageid }}.xml' \
$(find . -name "*.xml")
# only ALTO, no copying
ocrd workspace bulk-add \
--skip \
--regex '^(?P<dispname>[^/]*)/alto/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
--file-id 'FILE_GT-ALTO_{{ pageid }}' \
--page-id 'PHYS_{{ pageid }}' \
--file-grp OCR-D-GT-ALTO-SEG-PAGE \
--mimetype application/alto+xml \
--url 'alto/{{ dispname }}_{{ pageid }}.xml' \
$(find . -name "*.xml")
popd
done
(You cannot match the non-existing image subdirectory as fileGrp in this convention directly, and breaking it up allows a basic form of string transformation.)
In the common case where images and annotations reside in per-document directories with image files along PAGE-XML files of the same basename (as in the old LAREX bookpath convention, or in various GT collections), the following would import such books into (OCR-D conforming) METS, while not copying files into new (OCR-D conforming) paths:
# in the bookpath/library directory:
for book in */; do
pushd $book
ocrd workspace init
ocrd workspace set-id $book
ocrd workspace bulk-add \
--regex '^(?P<pageid>.*)\.xml$' \
--file-id 'OCR-D-GT-SEG-LINE_{{ pageid }}' \
--page-id 'PHYS_{{ pageid }}' \
--file-grp OCR-D-GT-SEG-LINE \
--url '{{ pageid }}.xml' \
$(find . -name "*.xml" -not -name mets.xml)
ocrd workspace bulk-add \
--regex '^(?P<pageid>.*)\.(^P<ext>[^.]*)$' \
--file-id 'OCR-D-IMG_{{ pageid }}' \
--page-id 'PHYS_{{ pageid }}' \
--file-grp OCR-D-IMG \
--url '{{ pageid }}.{{ ext }}' \
$(find . -type f -not -name "*.xml")
popd
done
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows