Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library of Congress Public Domain Books #74

Open
storytracer opened this issue May 15, 2024 · 6 comments
Open

Library of Congress Public Domain Books #74

storytracer opened this issue May 15, 2024 · 6 comments

Comments

@storytracer
Copy link

The Library of Congress Selected Digitized Books collection contains 135,500+ English public domain books with 47.6 billion tokens.

@storytracer
Copy link
Author

I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma.

@craffel
Copy link
Collaborator

craffel commented May 15, 2024

I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma.

They look pretty noisy. Is there some really basic heuristic filtering we can do? Say, filter out lines where the majority of characters are not alphanumeric?

@storytracer
Copy link
Author

storytracer commented May 15, 2024

In my experience the OCR noise in digitized books is concentrated in the front matter of the book, because the book starts with several blank pages containing library stamps or speckles which get misinterpreted as characters by the OCR engine. The noise level after the first 1-10 pages of each book or so should be fine. Since every book has a different amount of pages in the front matter though, I couldn't think of a good heuristic yet.

@craffel
Copy link
Collaborator

craffel commented May 15, 2024

Remove everything before the first N pure alphanumeric lines?

@storytracer
Copy link
Author

storytracer commented May 15, 2024

That could go wrong when you have unusual front matters, the range is really quite diverse. I can work with PleiAs to develop a heuristic or even model, since they deal with a lot of OCR text as well and have developed a library for OCR metrics and a promising post-OCR correction model. But they also question whether a little bit of noise in the front matter actually makes any difference in training, so I would like to leave the OCR text untouched for now until we have more insights into that. Would be great to create a general post-OCR dolma tagger based on their research, which we could easily apply to many different datasets.

@craffel
Copy link
Collaborator

craffel commented May 15, 2024

Noise is always bad!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants