Extract ToC from Deutsche Nationalbibliothek #10119
Labels
Lead: @cdrini
Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]
Module: Table of Contents
Needs: Breakdown
This big issue needs a checklist or subissues to describe a breakdown of work. [managed]
Needs: Triage
This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed]
Type: Proposal
Proposal
Deutsche Nationalbibliothek has high-quality scans of the table of contents for a large part of their holdings. These can be freely accessed as a PDF. For each OL edition that has a DNB identifier attached, OL could attempt to download the corresponding PDF and extract the ToC text. Note that some of the PDFs have a wildly inaccurate text layer, so it makes sense to run our own OCR.
Example edition page at DNB:
https://d-nb.info/973546166
Example TOC:
https://d-nb.info/973546166/04
See also #8756
Justification
Problem: OL currently only has a table of contents for a small fraction of editions. This impacts patrons’ ability to learn what a book is about.
Impact: Increase the number of ToCs, especially for German-language books.
Research: I’ve been manually OCRing and/or transcribing a number of TOCs from DNB for use on OL and can attest that the scans are of consistently high quality.
Breakdown
Requirements Checklist
Related files
Stakeholders
Instructions for Contributors
Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
The text was updated successfully, but these errors were encountered: