This repository contains code to extract multiple choice questions from Dutch high school exams.
- Download selected PDFs:
download_pdfs.py
- Extract plain text from PDFs:
pdf2text.py
- Extract multiple choice questions and answers from plain text:
text2json.py
This code could be used to extract and process data from Examenblad.nl, as published on alleexamens.nl. The rights of these exams belong to the State of The Netherlands. Please refer to their copyright statement for more information.
Note: The question filtering part should be improved before using these questions directly: e.g. adding more keywords that refer to outside sources and length filtering to avoid concatenated questions. Note2: Most questions are filtered by hand, as there were no keywords present.
Find the dataset on Huggingface.