Library of Congress Public Domain Books #74

storytracer · 2024-05-15T14:13:22Z

The Library of Congress Selected Digitized Books collection contains 135,500+ English public domain books with 47.6 billion tokens.

storytracer · 2024-05-15T14:15:50Z

I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma.

craffel · 2024-05-15T15:49:06Z

I have uploaded the 2024-05-13 snapshot to HF: https://huggingface.co/datasets/storytracer/loc_books_dolma.

They look pretty noisy. Is there some really basic heuristic filtering we can do? Say, filter out lines where the majority of characters are not alphanumeric?

storytracer · 2024-05-15T15:53:42Z

In my experience the OCR noise in digitized books is concentrated in the front matter of the book, because the book starts with several blank pages containing library stamps or speckles which get misinterpreted as characters by the OCR engine. The noise level after the first 1-10 pages of each book or so should be fine. Since every book has a different amount of pages in the front matter though, I couldn't think of a good heuristic yet.

craffel · 2024-05-15T16:10:16Z

Remove everything before the first N pure alphanumeric lines?

storytracer · 2024-05-15T16:23:41Z

That could go wrong when you have unusual front matters, the range is really quite diverse. I can work with PleiAs to develop a heuristic or even model, since they deal with a lot of OCR text as well and have developed a library for OCR metrics and a promising post-OCR correction model. But they also question whether a little bit of noise in the front matter actually makes any difference in training, so I would like to leave the OCR text untouched for now until we have more insights into that. Would be great to create a general post-OCR dolma tagger based on their research, which we could easily apply to many different datasets.

craffel · 2024-05-15T16:26:26Z

Noise is always bad!

storytracer mentioned this issue May 15, 2024

Library of Congress public domain books (loc_books) #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library of Congress Public Domain Books #74

Library of Congress Public Domain Books #74

storytracer commented May 15, 2024

storytracer commented May 15, 2024

craffel commented May 15, 2024

storytracer commented May 15, 2024 •

edited

Loading

craffel commented May 15, 2024

storytracer commented May 15, 2024 •

edited

Loading

craffel commented May 15, 2024

Library of Congress Public Domain Books #74

Library of Congress Public Domain Books #74

Comments

storytracer commented May 15, 2024

storytracer commented May 15, 2024

craffel commented May 15, 2024

storytracer commented May 15, 2024 • edited Loading

craffel commented May 15, 2024

storytracer commented May 15, 2024 • edited Loading

craffel commented May 15, 2024

storytracer commented May 15, 2024 •

edited

Loading

storytracer commented May 15, 2024 •

edited

Loading