Arxiver

A toolkit for downloading and converting arXiv papers to multi markdown (.mmd) format with Nougat - a neural OCR. Our pipeline can extract LaTeX equations and includes post-processing tools to clean up and merge extracted data. See the arxiver dataset on Hugging Face Hub for sample results.

Project Structure

arxiver/
    arxiv-tools/          # Tools for downloading arXiv papers
    utils/                # Utility files to check processed data, get article metadata, etc.
    run_nougat.py         # Batch PDF processing script to extract text in .mmd format
    job_status_server.py  # Web server to monitor extraction progress
    postprocess.py        # Post-processing scripting to clean and merge Nougat outputs

Downloading arXiv

The arxiv-tools folder contains scripts for downloading arXiv papers and computing useful statistics about the arXiv dataset. For detailed instructions, see the arxiv-tools README. Downloading and extracting the dataset creates a hierarchical folder structure organized by publication year and month as follows:

output_dir/
    2310/           # October 2023
        paper1.pdf
        paper2.pdf
    2311/           # November 2023
        paper3.pdf
        paper4.pdf

Nougat Processing

The run_nougat.py script processes PDF files in batches using the Nougat neural OCR model:

python run_nougat.py \
    --input_dir /path/to/datadir \
    --output_dir /path/to/output \
    --gpu_id 0 \
    --batch_size 8

You can run Nougat using the output data directory as an input argument. Running this script processes pdfs by batches on specified GPU and logs successful and failed jobs (Nougat is not 100% stable). Output structure maintains the same year-month-based subdirectory structure but saves each page separately:

output_dir/
    2310/
        paper1_1.mmd    # Paper 1, page 1
        paper1_2.mmd    # Paper 1, page 2
        paper2_1.mmd
    2311/
        paper3_1.mmd
        paper3_2.mmd
        paper4_1.mmd

Progress Monitoring

We provide an optinoal script, job_status_server.py to provide a web interface to monitor processing progress:

python job_status_server.py \
    --input_dir /path/to/pdf/files \
    --output_dir /path/to/output \
    --port 8005

Post-Processing

The post-processing pipeline includes several steps to validate and clean up the Nougat output. You can optionally check how many of the papers have been fully processed (all pages successfully extracted) by running:

cd utils
python check_complete_results.py --pdf-dir /path/to/pdf/root/dir --mmd-dir /path/to/mmd/root/dir

You can use the output .mmd files as they are or run post-processing to remove headers and references and merge multiple page MMD files into single documents operations. To do this, run the post-processing script:

cd ..
python postprocess.py --input-dir /path/to/processed-data --output-dir /path/to/output

Note that this script preserves the original hierarchical folder structure organized by publication year and month.

Metadata Extraction

You can optionally get article metadata by running:

cd utils
python extract_metadata.py --input-dir /path/to/merged-mmd-folder

Notes

GPU with CUDA support is required for efficient processing
Tested on an NVIDIA T4 GPU, processing speed depends on GPU memory and batch size
arxiv-tools/ is adapted from the original repo

From neuralwork with ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
arxiv-tools		arxiv-tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
job_status_server.py		job_status_server.py
postprocess.py		postprocess.py
requirements.txt		requirements.txt
run_nougat.py		run_nougat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arxiver

Project Structure

Downloading arXiv

Nougat Processing

Progress Monitoring

Post-Processing

Metadata Extraction

Notes

About

Releases

Packages

Languages

License

neuralwork/arxiver

Folders and files

Latest commit

History

Repository files navigation

Arxiver

Project Structure

Downloading arXiv

Nougat Processing

Progress Monitoring

Post-Processing

Metadata Extraction

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages