This document describes an exchange format to bundle a workspace described by a METS file following OCR-D's conventions.
METS is the exchange format of choice by OCR-D for describing relations of files such as images and metadata about those images such as PAGE or ALTO files. METS is a textual format, not suitable for embedding arbitrary, potentially binary, data. For various use cases (such as transfer via network, long-term preservation, reproducible tests etc.) it is desirable to have a self-contained representation of a workspace.
With such a representation, data producers are not forced to provide dereferenceable HTTP-URL for the files they produce and data consumers are not forced to dereference all HTTP-URL.
While METS does have mechanisms for embedding XML data and even base64-encoded binary data, the tradeoffs in file size, parsing speed and readability are too great to make this a viable solution for a mass digitization scenario.
Instead, we propose an exchange format ("OCRD-ZIP") based on the BagIt spec used for data ingestion adopted in the web archiving community.
As a baseline, an OCRD-ZIP must adhere to v0.97+ of the BagIt specs, i.e.
- all files in
data/
- a file
bagit.txt
- a file
bag-info.txt
In accordance with the BagIt standard, bagit.txt
MUST consist of exactly
these two lines:
BagIt-Version: 1.0
Tag-File-Character-Encoding: UTF-8
In addition, OCRD-ZIP adhere to a BagIt profile (see Appendix A for the full definition):
bag-info.txt
MUST additionally contain these tags:BagIt-Profile-Identifier
: URL of the OCR-D BagIt profileOcrd-Identifier
: A globally unique identifier for this bagOcrd-Base-Version-Checksum
: Checksum of the version this bag is based on
bag-info.txt
MAY additionally contain these tags:Ocrd-Mets
: Alternative path to the mets.xml file, relative to/data
, if its path IS NOTmets.xml
The BagIt-Profile-Identifier
must be the string https://ocr-d.de/en/spec/bagit-profile.json
.
Ocrd-Mets
can be provided to declare that the METS file will not be the
standard mets.xml
but another path relative to /data/
.
Implementations MUST check for the Ocrd-Mets
tag: If it has a value, look for the
METS file at that location, relative to /data
. Otherwise, assume the default
mets.xml
.
A globally unique identifier identifying the work/works/parts of works this bundle of file represents.
This is to be used for repositories to identify new ingestions of existing works.
To ensure global uniqueness, the identifier should be prefixed with an identifier of the organization, e.g. an ISIL or domain name.
The SHA512 checksum of the manifest-sha512.txt
file of the version this bag
was based on, if any.
An OCRD-ZIP MUST be a serialized as a ZIP file.
Checksums for the files in /data
must be calculated with the SHA512
algorithm only and provided as manifest-sha512.txt
.
Since the checksum of this manifest file can be relevant (see
Ocrd-Base-Version-Checksum
), in addition to the requirements
of the BagIt spec, the entries MUST be sorted.
NOTE: These checksums can be generated with find data -type f | sort -sf |xargs sha512sum > manifest-sha512.txt
.
Within an OCRD-ZIP, all local file resources referenced in the METS (and consequently all those referenced in other files within the workspace -- see rule "If in PAGE then in METS" must be relative to the location of the METS file.
/tmp/foo/ws1/data
├── mets.xml
├── foo.tif
└── foo.xml
Valid mets:FLocat/@xlink:href
in /tmp/foo/ws1/data/mets.xml
:
foo.xml
foo.tif
file://foo.tif
Invalid mets:FLocat/@xlink:href
in /tmp/foo/ws1/data/mets.xml
:
/tmp/foo/ws1/data/foo.xml
(absolute path)file:///tmp/foo/ws1/data/foo.tif
(file URL scheme with absolute path)file:///foo.tif
(relative path written as absolute path)
All files except mets.xml
itself that are contained in data
directory must
be referenced in a mets:file/mets:Flocat
in the mets.xml
.
All local files (mets:file/mets:FLocat/@xlink:href
that represent file paths) must be part of the OCRD-ZIP.
In addition to the actual data files in /data
, the following metadata files
are allowed to be present in the root of the bag:
README.md
: An extended, human-readable description of the dataset in the Markdown syntaxMakefile
: A GNU make build file to reproduce the data in/data
.build.sh
: A bash script to reproduce the data in/data
.sources.csv
: A comma-separated values list to be used in the scripts.
These files are purely for documentation and should not be used by processors in any way.
To pack a workspace to OCRD-ZIP:
- Create a temporary folder
TMP
- Foreach
mets:file
f
in the source METS:- Strip
file://
from the beginning of thexlink:href
off
- If it is not a file path (begins with
http://
orhttps://
):- continue
- Download/Copy the file to a location within
TMP/data
. The structure SHOULD be<USE>/<ID>
where<USE>
is theUSE
attribute of the parentmets:fileGrp
<ID>
is theID
attribute of themets:file
- Replace the URL of
f
with the path relative to/data
(SHOULD be<USE>/<ID>
) in- all
mets:FLocat
of the METS - all other files in the workspace, esp. PAGE-XML
- all
- Strip
- Write out the changed METS to
TMP/data/mets.xml
- Package
TMP
as a BagIt bag
- Unzip OCRD-ZIP
z
to a folderTMP
- If the value
M
ofOcrd-Mets
is different frommets.xml
:- Rename
TMP/data/mets.xml
toTMP/data/
+M
- Rename
- Move
TMP/data
to an appropriate location to use as a workspace
BagIt-Profile-Info:
BagIt-Profile-Identifier: https://ocr-d.de/en/spec/bagit-profile.json
BagIt-Profile-Version: '1.2.0'
Source-Organization: OCR-D
External-Description: BagIt profile for OCR data
Contact-Name: Konstantin Baierer
Contact-Email: [email protected]
Version: 0.1
Bag-Info:
Bagging-Date:
required: false
Source-Organization:
required: false
Ocrd-Mets:
required: false
default: 'mets.xml'
Ocrd-Identifier:
required: true
Ocrd-Checksum:
required: false
# echo -n | sha512sum
default: 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e'
Manifests-Required: ['sha512']
Tag-Manifests-Required: []
Tag-Files-Required: []
Tag-Files-Allowed:
- README.md
- Makefile
- build.sh
- sources.csv
- metadata/*.xml
- metadata/*.txt
Allow-Fetch.txt: false
Serialization: required
Accept-Serialization: application/zip
Accept-BagIt-Version:
- '1.0'
Proposed media type of OCRD-ZIP: application/vnd.ocrd+zip
Proposed extension: .ocrd.zip