Skip to content

Proof of concept code to extract chemical structures by brute force approaches using OPSIN

License

Notifications You must be signed in to change notification settings

dan2097/opsin-document-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OPSIN Document Extractor

This library uses OPSIN to find chemical names in free text that are parsable by OPSIN. This works by determining which words are parsable by OPSIN and assuming that contiguous space separated chemical words form part of the same chemical name. As OPSIN's parser is quite fast, in practice this is also a fast way of doing chemical entity recognition for systematic chemical names.

Example Usage

String input = "Pyridine and benzene are chemicals. ethylpyridines are a family of chemicals";
List<IdentifiedChemicalName> chemicalNames = new DocumentToStructures(input).extractNames();
for (IdentifiedChemicalName chemicalName : chemicalNames) {
	System.out.println(chemicalName.getStart() + "\t" + chemicalName.getEnd());//Character offsets of start/end of chemical name
	System.out.println(chemicalName.getTextValue());//The string of text between the start and the end of the chemical name
	System.out.println(chemicalName.getChemicalName());//The chemical name after character/case normalization
	System.out.println(chemicalName.getNameType());//The type of chemical name (complete/part/family/polymer)
}

Build Status

About

Proof of concept code to extract chemical structures by brute force approaches using OPSIN

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages