Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract ToC from Deutsche Nationalbibliothek #10119

Open
zorae opened this issue Dec 5, 2024 · 0 comments
Open

Extract ToC from Deutsche Nationalbibliothek #10119

zorae opened this issue Dec 5, 2024 · 0 comments
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Table of Contents Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Proposal

Comments

@zorae
Copy link

zorae commented Dec 5, 2024

Proposal

Deutsche Nationalbibliothek has high-quality scans of the table of contents for a large part of their holdings. These can be freely accessed as a PDF. For each OL edition that has a DNB identifier attached, OL could attempt to download the corresponding PDF and extract the ToC text. Note that some of the PDFs have a wildly inaccurate text layer, so it makes sense to run our own OCR.

Example edition page at DNB:
https://d-nb.info/973546166

Example TOC:
https://d-nb.info/973546166/04

See also #8756

Justification

Problem: OL currently only has a table of contents for a small fraction of editions. This impacts patrons’ ability to learn what a book is about.

Impact: Increase the number of ToCs, especially for German-language books.

Research: I’ve been manually OCRing and/or transcribing a number of TOCs from DNB for use on OL and can attest that the scans are of consistently high quality.

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

@zorae zorae added Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Dec 5, 2024
@mekarpeles mekarpeles added Type: Proposal Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Table of Contents and removed Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Needs: Lead labels Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Table of Contents Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Proposal
Projects
None yet
Development

No branches or pull requests

2 participants