Skip to content

Commit

Permalink
Polishing things up
Browse files Browse the repository at this point in the history
  • Loading branch information
JosselinSomervilleRoberts committed Jun 10, 2024
1 parent 8360010 commit b567afa
Show file tree
Hide file tree
Showing 2 changed files with 76 additions and 4 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ old_data/*
src/image2structure/compilation/webpage/test_data/valid_repo/_site/feed.xml
tmp/*
credentials/*
experimental/*

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
79 changes: 75 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
# Image2Structure - Data collection
# Image2Struct
[Paper](TODO) | [Website](https://crfm.stanford.edu/helm/image2structure/latest/) | Datasets ([Webpages](https://huggingface.co/datasets/stanford-crfm/i2s-webpage), [Latex](https://huggingface.co/datasets/stanford-crfm/i2s-latex), [Music sheets](https://huggingface.co/datasets/stanford-crfm/i2s-musicsheet)) | [Leaderboard](https://crfm.stanford.edu/helm/image2structure/latest/#/leaderboard) | [HELM repo](https://github.com/stanford-crfm/helm)

This repository contains the data collection for the Image2Structure project.
Welcome, the `image2struct` Python package contains code usied in the **Image2Struct: A Benchmark for Evaluating Vision-Language Models in Extracting Structured Information from Images** paper. This repo includes the following features:
* Data collection: scrapers, filters, compilers, and uploaders for the different data types (Latex, Webpages, MusicSheets) from public sources (ArXiV, GitHub, IMSLP, ...)
* Dataset upload: upload the datasets to the Hugging Face Datasets Hub
* Wild data collection: collection of screenshots from webpages specified by a determined list of URLs and formatting of equations screenshots of your choice.

This repo **does not** contain:
* The evaluation code: the evaluation code is available in the [HELM repo](https://github.com/stanford-crfm/helm).

## Installation
To install the package, you can use pip:
To install the package, you can use `pip` and `conda`:

conda create -n image2struct python=3.9.18 -y
conda activate image2struct
pip install -e ".[all]"

Some formats require additional dependencies. To install all dependencies, use:
Expand All @@ -14,7 +23,69 @@ Some formats require additional dependencies. To install all dependencies, use:
Finally, create a `.env` file by copying the `.env.example` file and filling in the required values.


# Contributing
## Usage

### Data collection

You can run `image2structure-collect` to collect data from different sources. For example, to collect data from GitHub Pages:

image2structure-collect --num_instances 300 --num_instances_at_once 50 --max_instances_per_date 40 --date_from 2024-01-01 --date_to 2024-02-20 --timeout 30 --destination_path data webpage --language css --port 4000 --max_size_kb 100

The general arguments are:
* `--num_instances`: the number of instances to collect
* `--num_instances_at_once`: the number of instances to collect at once. This means that when the scraper is called, it won't ask the API used (here GitHub Developer API) for more than `num_instances_at_once` instances. This is useful to avoid hitting the rate limit.
* `--max_instances_per_date`: the maximum number of instances to collect for a single date. This is useful to avoid collecting too many instances for a single date.
* `--date_from`: the starting date to collect instances from.
* `--date_to`: the ending date to collect instances from.
* `--timeout`: the timeout in seconds for each instance collection.
* `--destination_path`: the path to save the collected data to.

Then you can add specific arguments for the data type you want to collect. To do so simply add the data type, here `webpage`, followed by the data-specific arguments. You can find the data-specific arguments in the `src/image2struct/run_specs.py` file.

The script will save the collected data to the specified destination path under this format:

output_path
├── subcategory1
│ ├── assets
│ ├── images
│ | ├── uuid1.png
│ | ├── uuid2.png
│ | └── ...
│ ├── metadata
│ | ├── uuid1.json
│ | ├── uuid2.json
│ | └── ...
│ ├── structures # Depends on the data type
│ | ├── uuid1.{tex,tar.gz,...}
│ | ├── uuid2.{tex,tar.gz,...}
│ | └── ...
│ ├── (text) # Depends on the data type
│ | ├── uuid1.txt
│ | ├── uuid2.txt
│ | └── ...
├── subcategory2
└── ...

### Upload datasets

Once you have collected some datasets, you can upload them to the Hugging Face Datasets Hub. For example, to upload the latex dataset:

image2structure-upload --data-path data/latex --dataset-name stanford-crfm/i2s-latex --max-instances 50

This will upload the dataset to the Hugging Face Datasets Hub under the `stanford-crfm/i2s-latex` dataset name. The `max-instances` argument specifies the maximum number of instances to upload. The `--data-path` argument specifies the path to the dataset files. These files should respect the format outputed by the collection scripts.


### Wild data collection

There are two scripts to build the wild datasets: `src/image2struct/wildwild/wild_latex.py` and `src/image2struct/wildwild/wild_webpage.py`. You can simply run them to format the data (you will need to collect screenshots of equations manually for the `wild_latex` script while the `wild_webpage` will take screenshots of websites by itself):

python src/image2struct/wild/wild_webpage.py
python src/image2struct/wild/wild_latex.py

You can then upload the datasets to the Hugging Face Datasets Hub as explained above.

## Contributing
To contribute to this project, first install the development dependencies:

pip install -e ".[dev]"
Expand Down

0 comments on commit b567afa

Please sign in to comment.