OpenChemIE

Authors: Yujie Qian, Alex Wang, Vincent Fan, Amber Wang, Regina Barzilay (MIT CSAIL)

OpenChemIE is an open source toolkit for aiding with chemistry information extraction by offering methods for extracting molecule or reaction data from figures or text. Given PDFs from chemistry literature, we run specialized machine learning models to efficiently extract structured data. For text analysis, we provide methods for named entity recognition and reaction extraction. For figure analysis, we offer methods for molecule detection, text-figure coreference, molecule recognition, and reaction diagram parsing. For more information on the models involved, see below.

Citation

If you use OpenChemIE in your research, please cite our paper.

@misc{fan2024openchemie,
      title={OpenChemIE: An Information Extraction Toolkit For Chemistry Literature}, 
      author={Vincent Fan and Yujie Qian and Alex Wang and Amber Wang and Connor W. Coley and Regina Barzilay},
      year={2024},
      eprint={2404.01462},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Installation

First create and activate a conda virtual environment with the following

conda create -n openchemie python=3.9
conda activate openchemie

Run the following commands to install the package and its dependencies

pip install 'OpenChemIE @ git+https://github.com/CrystalEye42/OpenChemIE'

Alternatively, for development of the package, clone and install as editable with the following

git clone https://github.com/CrystalEye42/OpenChemIE.git
cd OpenChemIE
pip install --editable .

Additionally, if Poppler is not already installed on your system, follow the corresponding installation instructions for your OS.

Usage

Importing all models:

import torch
from openchemie import OpenChemIE

model = OpenChemIE(device=torch.device('cpu')) # change to cuda for gpu

List of Methods

Extracting Molecule Information From PDFs

These methods are for identifying and translating molecules in figures to their chemical structures, as well as for named entity recognition from texts.

import torch
from openchemie import OpenChemIE

model = OpenChemIE()
pdf_path = 'example/acs.joc.2c00749.pdf'  # Change it to the path of your PDF
# Figure analysis
figure_results = model.extract_molecules_from_figures_in_pdf(pdf_path)
# Text analysis
text_results = model.extract_molecules_from_text_in_pdf(pdf_path)

The output when extracting molecules from figures has the following format

[
    {   # first figure
        'image': ndarray of the figure image,
        'molecules': [
            {   # first molecule
                'bbox': tuple in the form (x1, y1, x2, y2),
                'score': float,
                'image': ndarray of cropped molecule image,
                'smiles': str,
                'molfile': str
            },
            # more molecules
        ],
        'page': int
    },
    # more figures
]

Output when extracting molecules from text has the following format

[
    {
        'molecules': [
            { # first paragraph
                'text': str,
                'labels': [
                    (str, int, int), # tuple of label, range start (inclusive), range end (exclusive)
                    # more labels
                ]
            },
            # more paragraphs
        ]
        'page': int
    },
    # more pages
]

Extracting Reaction Information From PDFs

These methods are for parsing reaction schemes and conditions from figures or from text.

import torch
from openchemie import OpenChemIE

model = OpenChemIE()
pdf_path = 'example/acs.joc.2c00749.pdf'
figure_results = model.extract_reactions_from_figures_in_pdf(pdf_path)
text_results = model.extract_reactions_from_text_in_pdf(pdf_path)

The output when extracting reactions from figures has the following format

[
    {
        'figure': PIL image
        'reactions': [
            {
                'reactants': [
                    {
                        'category': str,
                        'bbox': tuple (x1,x2,y1,y2),
                        'category_id': int,
                        'smiles': str,
                        'molfile': str,
                    },
                    # more reactants
                ],
                'conditions': [
                    {
                        'category': str,
                        'bbox': tuple (x1,x2,y1,y2),
                        'category_id': int,
                        'text': list of str,
                    },
                    # more conditions
                ],
                'products': [
                    # same structure as reactants
                ]
            },
            # more reactions
        ],
        'page': int
    },
    # more figures
]

visualization method (utility function) an example of the output

Output when extracting reactions from text has the following format

[
    {
        'page': int,
        'reactions': [
            {
                'tokens': list of words in relevant sentence,
                'reactions' : [
                    {
                        # key, value pairs where key is the label and value is a tuple of the form (tokens, start index, end index)
                        # where indices are for the corresponding token list and start and end are inclusive
                    },
                    # more reactions
                ]
            },
            # reactions in other sentences
        ]
    },
    # more pages
]

Multimodal Reaction Extraction from PDFs

These methods combine information across figures, text, and/or tables to extract a more comprehensive set of reactions.

import torch
from openchemie import OpenChemIE

model = OpenChemIE()
pdf_path = 'example/acs.joc.2c00749.pdf'
full_results = model.extract_reactions_from_pdf(pdf_path)

# following results are already contained in full results
resolved_text_results = model.extract_reactions_from_text_in_pdf_combined(pdf_path)
figure_table_results = model.extract_reactions_from_figures_and_tables_in_pdf(pdf_path)

The output when extracting the full results has the following format

{
    'figures': [
        {
            'figure': PIL image
            'reactions': [
                {
                    'reactants': [
                        {
                            'category': str,
                            'bbox': tuple (x1,x2,y1,y2),
                            'category_id': int,
                            'smiles': str,
                            'molfile': str,
                        },
                        # more reactants
                    ],
                    'conditions': [
                        {
                            'category': str,
                            'text': list of str,
                        },
                        # more conditions
                    ],
                    'products': [
                        # same structure as reactants
                    ]
                },
                # more reactions
            ],
            'page': int
        },
        # more figures
    ]
    'text': [
        {
            'page': page number
            'reactions': [
                {
                    'tokens': list of words in relevant sentence,
                    'reactions' : [
                        {
                            # key, value pairs where key is the label and value is a tuple
                            # or list of tuples of the form (tokens, start index, end index)
                            # where indices are for the corresponding token list and start and end are inclusive
                        }
                        # more reactions
                    ]
                }
                # more reactions in other sentences
            ]
        },
        # more pages
    ]
}

The output when extracting from text and resolving coreferences has the same format as the value associated with the 'text' key when extracting the full results. Similarly, the output when extracting from figures and tables has the same format as the value of the 'figures' key in the full results.

Extracting Molecule Corefs From PDFs

This method is for resolving coreferences between molecules in the figure and labels in the text.

extract_molecule_corefs_from_figures_in_pdf

import torch
from openchemie import OpenChemIE

model = OpenChemIE()
pdf_path = 'example/acs.joc.2c00749.pdf'
results = model.extract_molecule_corefs_from_figures_in_pdf(pdf_path)

The output has the following format

[
    {
        'bboxes': [
            {   # first bbox
                'category': '[Sup]', 
                'bbox': (0.0050025012506253125, 0.38273870663142223, 0.9934967483741871, 0.9450094869920168), 
                'category_id': 4, 
                'score': -0.07593922317028046
            },
            # More bounding boxes
        ],
        'corefs': [
            [0, 1],  # molecule bounding box index, identifier bounding box index
            [3, 4],
            # More coref pairs
        ],
        'page': int
    },
    # More figures
]

Extracting From a List of Figures

The previous methods were for extracting directly from a PDF file. Below, we provide corresponding methods for extracting from a list of figure images instead.

import torch
from openchemie import OpenChemIE
import cv2
from PIL import Image

model = OpenChemIE()

img = cv2.imread('example/img1.png')
img2 = cv2.imread('example/img2.png')
img3 = Image.open('example/img3.png')
images = [img, img2, img3] # supports both cv2 and PIL images

molecule_results = model.extract_molecules_from_figures(images)
reaction_results = model.extract_reactions_from_figures(images)
bbox_results = model.extract_molecule_bboxes_from_figures(images)
coref_results = model.extract_molecule_corefs_from_figures(images)

The output format for these methods are largely the same as their corresponding PDF methods, just missing the 'page' key. However, for extracting molecule bounding boxes from images, (which doesn't have a corresponding method,) the output has the following format

[
    [   # first figure
        {   # first bounding box
            'category': str,
            'bbox': tuple in the form (x1, y1, x2, y2),
            'category_id': int,
            'score': float
        },
        # more bounding boxes
    ],
    # more figures
]

Extracting Figures and Tables From PDFs

These are helper methods for extracting just tables or figures without performing further analysis on them.

import torch
from openchemie import OpenChemIE

model = OpenChemIE()
pdf_path = 'example/acs.joc.2c00749.pdf'
figures = model.extract_figures_from_pdf(pdf_path, output_bbox=False, output_image=True)
tables = model.extract_tables_from_pdf(pdf_path, output_bbox=False, output_image=True)

Output format when extracting figures

[
    {   # first figure
        'title': {
            'text': str,
            'bbox': list in form [x1, y1, x2, y2],
        },
        'figure': {
            'image': PIL image or None,
            'bbox': list in form [x1, y1, x2, y2],
        }
        'table': {
            'bbox': list in form [x1, y1, x2, y2] or empty list,
            'content': {
                'columns': list of column headers,
                'rows': list of list of row content,
            } or None
        }
        'footnote': str or empty,
        'page': int
    },
    # more figures
]

Output format when extracting tables

[
    { # first table
        'title': {
            'text': str,
            'bbox': list in form [x1, y1, x2, y2],
        },
        'figure': {
            'image': PIL image or None,
            'bbox': list in form [x1, y1, x2, y2] or empty list,
        }
        'table': {
            'bbox': list in form [x1, y1, x2, y2] or empty list,
            'content': {
                'columns': list of column headers,
                'rows': list of list of row content,
            }
        }
        'footnote': str or empty,
        'page': int
    },
    # more tables
]

Data

Data for evaluating the process of R-group resolution is found at this HuggingFace repository. The HuggingFace repository contains every diagram in the dataset here as well as groundtruth annotations here.

The annotations take the following format:

[
  {
    "file_name": "acs.joc.2c00176 example 1.png",
    "reaction_template": {
      "reactants": [
        "*C(=O)NN=CC(F)(F)F",
        "N#CN"
      ],
      "products": [
        "*C(=O)N1NC(C(F)(F)F)N=C1N"
      ]
    },
    "detailed_reactions": {
      "a": {
        "reactants": [
          "O=C(NN=CC(F)(F)F)c1ccccc1",
          "N#CN"
        ],
        "products": [
          "NC1=NC(C(F)(F)F)NN1C(=O)c1ccccc1"
        ]
      },
      # more reactions
    }
  },
  # more diagrams
]

Additionally, Jupyter notebooks used during the annotation process can be downloaded here.

Diagrams and data used in the comparison against Reaxys can also be found in the same HuggingFace repository.

Loading Custom Model Checkpoints

Init methods for models

To load a specific checkpoint for a model, pass in your path to checkpoint to its corresponding init method. For example, to change the checkpoint of MolScribe

import torch
from huggingface_hub import hf_hub_download
from openchemie import OpenChemIE

model = OpenChemIE()

ckpt_path = hf_hub_download("yujieq/MolScribe", "swin_base_char_aux_1m680k.pth")
model.init_molscribe(ckpt_path)

Models In OpenChemIE

MolScribe
- An image-to-graph model for molecular structure recognition
- Paper: https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480
- Code: https://github.com/thomas0809/MolScribe
- Demo: https://huggingface.co/spaces/yujieq/MolScribe
RxnScribe
- An image-to-sequence generation model for reaction diagram parsing
- Paper: https://pubs.acs.org/doi/10.1021/acs.jcim.3c00439
- Code: https://github.com/thomas0809/rxnscribe
- Demo: https://huggingface.co/spaces/yujieq/RxnScribe
MolDetect and MolCoref
- An image-to-sequence generation model for identifying molecule bounding boxes, or for resolving coreferences between labels and molecules
- Code: https://github.com/Ozymandias314/MolDetect/tree/main
ChemNER
- A sequence labeling model for identifying chemical entities in text
- Code: https://github.com/Ozymandias314/ChemIENER
ChemRxnExtractor
- A sequence labeling model for parsing reactions from text
- Paper: https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.1c00284
- Code: https://github.com/jiangfeng1124/ChemRxnExtractor

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
example		example
openchemie		openchemie
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenChemIE

Citation

Installation

Usage

List of Methods

Extracting Molecule Information From PDFs

Extracting Reaction Information From PDFs

Multimodal Reaction Extraction from PDFs

Extracting Molecule Corefs From PDFs

Extracting From a List of Figures

Extracting Figures and Tables From PDFs

Data

Loading Custom Model Checkpoints

Models In OpenChemIE

About

Releases

Packages

Contributors 4

Languages

License

CrystalEye42/OpenChemIE

Folders and files

Latest commit

History

Repository files navigation

OpenChemIE

Citation

Installation

Usage

List of Methods

Extracting Molecule Information From PDFs

Extracting Reaction Information From PDFs

Multimodal Reaction Extraction from PDFs

Extracting Molecule Corefs From PDFs

Extracting From a List of Figures

Extracting Figures and Tables From PDFs

Data

Loading Custom Model Checkpoints

Models In OpenChemIE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages