Named Entity Recognition for extracting Open Source Hardware project metadata

In this repository you can find an application, which introduces a main focus in Named Entity Recognition (NLP or NER).
The aim is to extract common entities within a text (or webpages).
For this purpose, the Spacy library is used to train a deep learning model based on neural netrworks to detect entities from text data.
To be able to train the model, it will also show how to create a train dataset and label them in order to perform NER.
This project is designed for the end users or developers, who need to receive the charateristics of OSH projects.
The selected characteristics are manufacturing process, machine type, material and dimensions.
As a future step, this informations can be linked to the suitable manufacturers for users to find corresponding Makerspace/Fablabs in order to manufacture their prototypes or individualized products.

CURRENT STATUS

Technology status of the project is OTRL 4, which means it is in a early prototype stage
Documentation status of the project is ODRL 3, which means it is an early release.
The next steps are
- Completing the documentation
- New use cases
- Enhancing the algorithm for a better accuracy
The needed skills are basic python skills and some enthasuiasm.

THE PROBLEM

There were already a lot of algortihms for Named Entity Recognition, but not for this specific topic with manufacturing characteristics. The algortihm can still be trained with other train data or for other parameters to extract other needed information and enhance the use cases.

PRODUCT FEATURES/FUNCTIONS

The algorithm uses NER of SpaCy to train the model with Deep Learning (NN) for the characteristics mentioned.
With a web application any kind of text can be given as input, and the output could be received with the classified entities.
The application can be used for any plain text input or Mediawiki-based websites (while this was only tested on Appropedia)

THE TEAM

As a research partner of the OPEN!NEXT project, a small team from Fraunhofer IPK has created the solution so as to contribute for OSH.
OPEN!NEXT received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 869984.
Our team has some basic Python skills.

ABOUT THE DESIGN

We will perform the following: (how is this process with the application?)

Create a train dataset from OSH project websites. (Giving example?)
Label the dataset with the selected entities using Doccano labeling tool manually.
Save the labels in a text file as JSONL.
Spacy Neural Network model to train a new statistical model.
We will save the model.
We will create a Spacy NLP pipeline and use the new model to detect manufacturing entities.

PROGRAMMING

Needed software:
- Python Compiler (Python 3.7 or more)
- Doccano (1.3.0)
Needed libraries:
- SpaCy (3.0.5)
- json
- Tkinter
- requets
- bs4 as Beautiful Soup
- Pandas
- Pickle
Need to download an IDE environmennt on you computer, clone the repository on you IDE, then to run the application, run either runModel_support file one or two

INSTRUCTIONS TO RUN THE APPLICATION

Copy the URL of the repository
Open the python IDE which allows to clone github repository, in our case we used PyCharm Community Edition 2020.3.2
create a new project and save it
In the project, go to vcs and click on create git repository (for the ide it may be that you need to install git)
Instead of vcs, it will show git, under git select clone and paste the url
It will ask to login (when private)
After cloning the repository, it will open in your ide, you need to create the python environment if it has not happened automaticalyy, and the python compiler for opening application 1 or 2, go from directory to selected app, run the runModel_support
The application will start in a new window, you can give your input and see the results

INSTRUCTIONS FOR MODIFICATION

CREATING A DATASET

To create a dataset, our team used basic definitions and some open source hardware plattforms. The train dataset is saved as txt to import in doccano later on.

LABELING THE DATA

This step is for labeling the entities using Doccano, but if you already have labeled data, you can skip this step and directly go to training the model.
First step is to install doccano, please follow the doccano instructions and open the program
For WIndows installation,
- pip install doccano
- doccano
Go to http://127.0.0.1:8000/
Login with username: admin and password:password

- Click create, type in your project name, description and select the sequence labeling project type

- After creating project, click on dataset an import your dataset - Go to Labels and create your labels - Go back to your imported dataset and click annotate - The first entity will open, you can select the word/s you want to label and continue untill all your dataset is labeled accordingly - After finishing the labeling process, click "Export Dataset" as JSONL file format under Actions and save the file

TRAINING THE MODEL

First step is to install spacy
Firstly we read the JSONL file:

  import json
  labeled_data = []
  with open(r"project_1_dataset_v4.jsonl", "r", encoding='utf-8') as read_file:
    for line in read_file:
        data = json.loads(line)
        labeled_data.append(data)
  print(labeled_data)

After reading our data, we need to convert the format

 TRAINING_DATA = []
  for entry in labeled_data:
      entities = []
      for e in entry['labels']:
          entities.append((e[0], e[1],e[2]))
      spacy_entry = (entry['text'], {"entities": entities})
      TRAINING_DATA.append(spacy_entry)
 print(TRAINING_DATA)

Next step is to train the model- We use Deep Learning (NN) and set a dropout rate of 0.3 to prevent overfitting.

import spacy
import random
import json
from spacy.tokens import Doc
from spacy.training import Example
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe('ner')
for _, annotations in TRAINING_DATA:                    #goes through all the entities are get the name token.label_ one
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])
# Start the training
nlp.begin_training()
# Loop for 40 iterations
for itn in range(40):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
# Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        for text, annotations in batch:
        # create Example
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            # Update the model
            nlp.update([example], losses=losses, drop=0.3)
            example = Example.from_dict(doc, annotations)
    print(losses)

While training data, you can receive some warnings at first, then the iterations should start and take a few minutes

After iterations stop, we save the model

 nlp.to_disk("./my.model")

Testing the model

from spacy import displacy
example = "an example test"
doc = nlp(example)
displacy.render(doc, style='ent')

POTENTIAL IMPROVEMENTS

The algorithm can be used in different use cases with corresponding train data

References used:

NLP: Named Entity Recognition (NER) with Spacy and Python

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
Labeled data		Labeled data
Model		Model
Raw Data		Raw Data
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Named Entity Recognition for extracting Open Source Hardware project metadata

CURRENT STATUS

THE PROBLEM

PRODUCT FEATURES/FUNCTIONS

THE TEAM

ABOUT THE DESIGN

PROGRAMMING

INSTRUCTIONS TO RUN THE APPLICATION

INSTRUCTIONS FOR MODIFICATION

CREATING A DATASET

LABELING THE DATA

TRAINING THE MODEL

POTENTIAL IMPROVEMENTS

About

Releases 3

Packages

Languages

License

OPEN-NEXT/Named-Entity-Recognition-for-extracting-Open-Source-Hardware-project-metadata

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition for extracting Open Source Hardware project metadata

CURRENT STATUS

THE PROBLEM

PRODUCT FEATURES/FUNCTIONS

THE TEAM

ABOUT THE DESIGN

PROGRAMMING

INSTRUCTIONS TO RUN THE APPLICATION

INSTRUCTIONS FOR MODIFICATION

CREATING A DATASET

LABELING THE DATA

TRAINING THE MODEL

POTENTIAL IMPROVEMENTS

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages