Skip to content

An open source no-code system for text annotation and building text classifiers

License

Notifications You must be signed in to change notification settings

ygyg70/label-sleuth

 
 

Repository files navigation

version license python python test react test Slack

Label Sleuth

Label Sleuth is an open source no-code system for text annotation and building text classifers. With Label Sleuth, domain experts (e.g., physicians, lawyers, psychologists) can quickly create custom NLP models by themselves, with no dependency on NLP experts.

Creating real-world NLP models typically requires a combination of two expertise - deep knowledge of the target domain, provided by domain experts, and machine learning knowledge, provided by NLP experts. Thus, domain experts are dependent on NLP experts. Label Sleuth comes to eliminate this dependency. With an intuitive UX, it escorts domain experts in the process of labeling the data and building NLP models which are tailored to their specific needs. As domain experts label examples within the system, machine learning models are being automatically trained in the background, make predictions on new examples, and provide suggestions for the users on the examples they should label next.

Label Sleuth is a no-code system, no knowledge in machine learning is need, and - it is fast to obtain a model – from task definition to a working model in just a few hours!

Table of contents

Installation for end users

Setting up a development environment

Project structure

Using the system

Customizing the system

Reference

Installation for end users (non-developers)

Follow the instructions on our website.

Setting up a development environment

The system requires Python 3.8 or 3.9 (other versions are currently not supported and may cause issues).

  1. Clone the repository:

    git clone [email protected]:label-sleuth/label-sleuth.git

  2. cd to the cloned directory: cd label-sleuth

  3. Install the project dependencies using conda (recommended) or pip:

Installing with conda

# Create and activate a virtual environment:
conda create --yes -n label-sleuth python=3.9
conda activate label-sleuth
# Install requirements
pip install -r requirements.txt

Installing with pip

Assuming python 3.8/3.9 is already installed.

pip install -r requirements.txt

  1. Start the Label Sleuth server: run python -m label_sleuth.start_label_sleuth.

    By default all project files are written to <home_directory>/label-sleuth, to change the directory add --output_path <your_output_path>.

    You can add --load_sample_corpus wiki_animals_2000_pages to load a sample corpus into the system at startup. This fetches a collection of Wikipedia documents from the data-examples repository.

    By default, the host will be localhost to expose the server only on the host machine. If you wish to expose the server to external communication, add --host <IP> for example, --host 0.0.0.0 to listen to all IPs.

    Default port is 8000, to change the port add --port <port_number> to the command.

    The system can then be accessed by browsing to http://localhost:8000 (or http://localhost:<port_number>)

Project Structure

The repository consists of a backend library, written in Python, and a frontend that uses React. A compiled version of the frontend can be found under label_sleuth/build.

Using the system

See our website for a simple tutorial that illustrates how to use the system with a sample dataset of Wikipedia pages. Before starting the tutorial, make sure you pre-load the sample dataset by running:

python -m label_sleuth.start_label_sleuth --load_sample_corpus wiki_animals_2000_pages.

Customizing the system

System configuration

The configurable parameters of the system are specified in a json file. The default configuration file is label_sleuth/config.json.

A custom configuration can be applied by passing the --config_path parameter to the "start_label_sleuth" command, e.g., python -m label_sleuth.start_label_sleuth --config_path <path_to_my_configuration_json>

Configurable parameters:

Parameter Description
first_model_positive_threshold Number of elements that must be assigned a positive label for the category in order to trigger the training of a classification model.

See also: The training invocation documentation.
changed_element_threshold Number of changes in user labels for the category -- relative to the last trained model -- that are required to trigger the training of a new model. A change can be a assigning a label (positive or negative) to an element, or changing an existing label. Note that first_model_positive_threshold must also be met for the training to be triggered.

See also: The training invocation documentation.
training_set_selection_strategy Strategy to be used from TrainingSetSelectionStrategy. A TrainingSetSelectionStrategy determines which examples will be sent in practice to the classification models at training time - these will not necessarily be identical to the set of elements labeled by the user. For currently supported implementations see get_training_set_selector().

See also: The training set selection documentation.
model_policy Policy to be used from ModelPolicies. A ModelPolicy determines which type of classification model(s) will be used, and when (e.g. always / only after a specific number of iterations / etc.).

See also: The model selection documentation.
active_learning_strategy Strategy to be used from ActiveLearningStrategies. An ActiveLearner module implements the strategy for recommending the next elements to be labeled by the user, aiming to increase the efficiency of the annotation process. For currently supported implementations see get_active_learner().

See also: The active learning documentation.
precision_evaluation_size Sample size to be used for estimating the precision of the current model. To be used in future versions of the system, which will provide built-in evaluation capabilities.
apply_labels_to_duplicate_texts Specifies how to treat elements with identical texts. If true, assigning a label to an element will also assign the same label to other elements which share the exact same text; if false, the label will only be assigned to the specific element labeled by the user.
language Specifies the chosen system-wide language. This determines some language-specific resources that will be used by models and helper functions (e.g., stop words). The list of supported languages can be found in Languages. We welcome contributions of additional languages.
login_required Specifies whether or not using the system will require user authentication. If true, the configuration file must also include a users parameter
users Only relevant if login_required is true. Specifies the pre-defined login information in the following format:
"users":[
 {
   "username": "<predefined_username1>",
   "token":"<randomly_generated_token1>",
   "password":"<predefined_user1_password>"
 }
]
* The list of usernames is static and currently all users have access to all the workspaces in the system.

Implementing new components

Label Sleuth is a modular system. We welcome the contribution of additional implementations for the various modules, aiming to support a wider range of user needs and to harness efficient and innovative machine learning algorithms.

Below are instructions for implementing new models and active learning strategies:

Implementing a new machine learning model

These are the steps for integrating a new classification model:

  1. Implement a new ModelAPI

Machine learning models are integrated by adding a new implementation of the ModelAPI.

The main functions are _train(), load_model() and infer():

def _train(self, model_id: str, train_data: Sequence[Mapping], model_params: Mapping):
  • model_id
  • train_data - a list of dictionaries with at least the "text" and "label" fields. Additional fields can be passed e.g. [{'text': 'text1', 'label': 1, 'additional_field': 'value1'}, {'text': 'text2', 'label': 0, 'additional_field': 'value2'}]
  • model_params - dictionary for additional model parameters (can be None)
def load_model(self, model_path: str):
  • model_path: path to a directory containing all model components

Returns an object that contains all the components that are necessary to perform inference (e.g., the trained model itself, the language recognized by the model, a trained vectorizer/tokenizer etc.).

def infer(self, model_components, items_to_infer) -> Sequence[Prediction]:
  • model_components: the return value of load_model(), i.e. an object containing all the components that are necessary to perform inference
  • items_to_infer: a list of dictionaries with at least the "text" field. Additional fields can be passed, e.g. [{'text': 'text1', 'additional_field': 'value1'}, {'text': 'text2', 'additional_field': 'value2'}]

Returns a list of Prediction objects - one for each item in items_to_infer - where Prediction.label is a boolean and Prediction.score is a float in the range [0-1]. Additional outputs can be passed by inheriting from the base Prediction class and overriding the get_predictions_class() method.

  1. Add the newly implemented ModelAPI to ModelsCatalog

  2. Add one or more policies that use the new model to ModelPolicies

Implementing a new active learning strategy

These are the steps for integrating a new active learning approach:

  1. Implement a new ActiveLearner

Active learning modules are integrated by adding a new implementation of the ActiveLearner API. The function to implement is get_per_element_score:

 def get_per_element_score(self, candidate_text_elements: Sequence[TextElement],
                           candidate_text_element_predictions: Sequence[Prediction], workspace_id: str,
                           dataset_name: str, category_name: str) -> Sequence[float]:    

Given sequences of text elements and the model predictions for these elements, this function returns an active learning score for each element. The elements with the highest scores will be recommended for the user to label next.

  1. Add the newly implemented ActiveLearner to the ActiveLearningCatalog

Reference

Eyal Shnarch, Alon Halfon, Ariel Gera, Marina Danilevsky, Yannis Katsis, Leshem Choshen, Martin Santillan Cooper, Dina Epelboim, Zheng Zhang, Dakuo Wang, Lucy Yip, Liat Ein-Dor, Lena Dankin, Ilya Shnayderman, Ranit Aharonov, Yunyao Li, Naftali Liberman, Philip Levin Slesarev, Gwilym Newton, Shila Ofek-Koifman, Noam Slonim and Yoav Katz (EMNLP 2022). Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours.

Please cite:

@inproceedings{shnarch2022labelsleuth,
	title={{L}abel {S}leuth: From Unlabeled Text to a Classifier in a Few Hours},
	author={Shnarch, Eyal and Halfon, Alon and Gera, Ariel and Danilevsky, Marina and Katsis, Yannis and Choshen, Leshem and Cooper, Martin Santillan and Epelboim, Dina and Zhang, Zheng and Wang, Dakuo and Yip, Lucy and Ein-Dor, Liat and Dankin, Lena and Shnayderman, Ilya and Aharonov, Ranit and Li, Yunyao and Liberman, Naftali and Slesarev, Philip Levin and Newton, Gwilym and Ofek-Koifman, Shila and Slonim, Noam and Katz, Yoav},
	booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ({EMNLP}): System Demonstrations},
	publisher={Association for Computational Linguistics},
  	url={https://arxiv.org/abs/2208.01483},
  	year={2022}
}

About

An open source no-code system for text annotation and building text classifiers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 52.3%
  • JavaScript 42.5%
  • CSS 4.8%
  • Other 0.4%