-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library of Congress Public Domain Books #74
Comments
I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma. |
They look pretty noisy. Is there some really basic heuristic filtering we can do? Say, filter out lines where the majority of characters are not alphanumeric? |
In my experience the OCR noise in digitized books is concentrated in the front matter of the book, because the book starts with several blank pages containing library stamps or speckles which get misinterpreted as characters by the OCR engine. The noise level after the first 1-10 pages of each book or so should be fine. Since every book has a different amount of pages in the front matter though, I couldn't think of a good heuristic yet. |
Remove everything before the first N pure alphanumeric lines? |
That could go wrong when you have unusual front matters, the range is really quite diverse. I can work with PleiAs to develop a heuristic or even model, since they deal with a lot of OCR text as well and have developed a library for OCR metrics and a promising post-OCR correction model. But they also question whether a little bit of noise in the front matter actually makes any difference in training, so I would like to leave the OCR text untouched for now until we have more insights into that. Would be great to create a general post-OCR dolma tagger based on their research, which we could easily apply to many different datasets. |
Noise is always bad! |
The Library of Congress Selected Digitized Books collection contains 135,500+ English public domain books with 47.6 billion tokens.
The text was updated successfully, but these errors were encountered: