Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoMeta tool for reference #1

Open
juhoinkinen opened this issue Nov 28, 2023 · 3 comments
Open

AutoMeta tool for reference #1

juhoinkinen opened this issue Nov 28, 2023 · 3 comments

Comments

@juhoinkinen
Copy link
Member

juhoinkinen commented Nov 28, 2023

Found this AutoMeta tool for metadata extraction:

AutoMeta is a metadata extractor tool for scanned Electronic Theses and Dissertations (ETDs). It has been built to extract seven metadata fields from the cover page of scanned ETDs. These fields are: Title, Author, Advisor, University, Degree, Program, and Year. It utilize learning based model such as CRF model with text-based and visual-based features.

We could check the quality of its results compare to Meteor and LLM based methods.

@juhoinkinen
Copy link
Member Author

The group behind AutoMeta has published some papers about theses and dissertations.

Here is a presentation about ETDSuite/ETDMiner (a library including AutoMeta), which aims to segment and parse theses and dissertations. The segmentation employs a multimodal model to classify pages into 13 categories.

@juhoinkinen
Copy link
Member Author

The poster paper "A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations" is most relevant for just metadata extraction, although from 2020:

The process started with converting scanned pages into images and then
text files by applying OCR tools. Then a series of carefully designed
regular expressions for each field is applied, capturing patterns
for seven metadata fields: titles, authors, years, degrees, academic
programs, institutions, and advisors. The method is evaluated on a
ground truth dataset comprised of rectified metadata provided by
the Virginia Tech and MIT libraries.

@osma
Copy link
Member

osma commented Dec 1, 2023

This more recent paper by (mostly) the same authors goes into more detail:
MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries
https://ieeexplore.ieee.org/document/10265916
preprint https://arxiv.org/abs/2303.17661

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants