Skip to content

Latest commit

 

History

History
304 lines (209 loc) · 13.3 KB

README.md

File metadata and controls

304 lines (209 loc) · 13.3 KB

drawing

Rubrix is an open-source Python framework to label, refine and monitor data for NLP

CI Codecov CI CI CI CI CI CI CI CI


👩🏾‍💻 Join the community on Slack 〰️ 📚 Docs 〰️ 🚀 Get started 〰️ 🔗 Quick links

Rubrix.mp4

Example: Named Entity Recognition data exploration and annotation with spaCy and the IMDB dataset

What is Rubrix?

Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Why Rubrix?

  • Open: Rubrix is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.

  • End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Rubrix is designed to close this gap, enabling you to iterate as much as you need.

  • User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Rubrix optimizes the experience for these core users to make your teams more productive.

  • Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

Features

Advanced NLP labeling

  • Programmatic labeling using Weak Supervision. Built-in label models (Snorkel, Flyingsquid)
  • Bulk-labeling and search-driven annotation
  • Iterate on training data with any pre-trained model or library
  • Efficiently review and refine annotations in the UI and with Python
  • Use Rubrix built-in metrics and methods for finding label and data errors (e.g., cleanlab)
  • Simple integration with active learning workflows

Monitoring

  • Close the gap between production data and data collection activities
  • Auto-monitoring for major NLP libraries and pipelines (spaCy, Hugging Face, FlairNLP)
  • ASGI middleware for HTTP endpoints
  • Rubrix Metrics to understand data and model issues, like entity consistency for NER models
  • Integrated with Kibana for custom dashboards

Team workspaces

  • Bring different users and roles into the NLP data and model lifecycles
  • Organize data collection, review and monitoring into different workspaces
  • Manage workspace access for different users

Get started

To get started you need to follow three steps:

  1. Install the Python client
  2. Launch the web app
  3. Start logging data

🆕 Rubrix Cloud Beta: Use Rubrix on a scalable cloud infrastructure without installing the server. Join the waiting list

1. Install the Python client

You can install the Python client with pip or conda.

with pip

pip install rubrix

with conda

conda install -c conda-forge rubrix

2. Launch the web app

There are two ways to launch the web app:

  • a) Using docker-compose (recommended).
  • b) Executing the server code manually

a) Using docker-compose (recommended)

Create a folder:

mkdir rubrix && cd rubrix

and launch the docker-contained web app with the following command:

wget -O docker-compose.yml https://git.io/rb-docker && docker-compose up

This is the recommended way because it automatically includes an Elasticsearch instance, Rubrix's main persistence layer.

b) Executing the server code manually

When executing the server code manually you need to provide an Elasticsearch instance yourself.

  1. First you need to install Elasticsearch (we recommend version 7.10) and launch an Elasticsearch instance. For MacOS and Windows there are Homebrew formulae and a msi package, respectively.
  2. Install the Python client together with its server dependencies:
pip install rubrix[server]
  1. Launch a local instance of the web app
python -m rubrix.server

By default, the Rubrix server will look for your Elasticsearch endpoint at http://localhost:9200. But you can customize this by setting the ELASTICSEARCH environment variable.

3. Start logging data

The following code will log one record into a data set called example-dataset:

import rubrix as rb

rb.log(
    rb.TextClassificationRecord(inputs="My first Rubrix example"),
    name='example-dataset'
)

If you go to your Rubrix web app at http://localhost:6900/, you should see your first dataset. The default username and password are rubrix and 1234. You can also check the REST API docs at http://localhost:6900/api/docs.

Congratulations! You are ready to start working with Rubrix. You can continue reading a working example below.

To better understand what's possible take a look at Rubrix's Cookbook

Quick links

Doc Description
🚶 First steps New to Rubrix and want to get started?
👩‍🏫 Concepts Want to know more about Rubrix concepts?
🛠️ Setup and install How to configure and install Rubrix
🗒️ Tasks What can you use Rubrix for?
📱 Web app reference How to use the web-app for data exploration and annotation
🐍 Python client API How to use the Python classes and methods
👩‍🍳 Rubrix cookbook How to use Rubrix with your favourite libraries (flair, stanza...)
👋 Community forum Ask questions, share feedback, ideas and suggestions
🤗 Hugging Face tutorial Using Hugging Face transformers with Rubrix for text classification
💫 spaCy tutorial Using spaCy with Rubrix for NER projects
🐠 Weak supervision tutorial How to leverage weak supervision with snorkel & Rubrix
🤔 Active learning tutorial How to use active learning with modAL & Rubrix

Example

Let's see Rubrix in action with a quick example: Bootstraping data annotation with a zero-shot classifier

Why:

  • The availability of pre-trained language models with zero-shot capabilities means you can, sometimes, accelerate your data annotation tasks by pre-annotating your corpus with a pre-trained zeroshot model.
  • The same workflow can be applied if there is a pre-trained "supervised" model that fits your categories but needs fine-tuning for your own use case. For example, fine-tuning a sentiment classifier for a very specific type of message.

Ingredients:

  • A zero-shot classifier from the 🤗 Hub: typeform/distilbert-base-uncased-mnli
  • A dataset containing news
  • A set of target categories: Business, Sports, etc.

What are we going to do:

  1. Make predictions and log them into a Rubrix dataset.
  2. Use the Rubrix web app to explore, filter, and annotate some examples.
  3. Load the annotated examples and create a training set, which you can then use to train a supervised classifier.

1. Predict and log

Let's load the zero-shot pipeline and the dataset (we are using the AGNews dataset for demonstration, but this could be your own dataset). Then, let's go over the dataset records and log them using rb.log(). This will create a Rubrix dataset, accesible from the web app.

from transformers import pipeline
from datasets import load_dataset
import rubrix as rb

model = pipeline('zero-shot-classification', model="typeform/distilbert-base-uncased-mnli")

dataset = load_dataset("ag_news", split='test[0:100]')

labels = ['World', 'Sports', 'Business', 'Sci/Tech']

for item in dataset:
    prediction = model(item['text'], labels)

    record = rb.TextClassificationRecord(
        inputs=item["text"],
        prediction=list(zip(prediction['labels'], prediction['scores']))
    )

    rb.log(record, name="news_zeroshot")

2. Explore, Filter and Label

Now let's access our Rubrix dataset and start annotating data. Let's filter the records predicted as Business with high probability and use the bulk-labeling feature for labeling 15 records as Business:

Zeroshot.Example.mp4

3. Load and create a training set

After a few iterations of data annotation, we can load the Rubrix dataset and create a training set to train or fine-tune a supervised model.

# load the Rubrix dataset as a pandas DataFrame
rb_df = rb.load(name='news_zeroshot')

# filter annotated records
rb_df = rb_df[rb_df.status == "Validated"]

# select text input and the annotated label
train_df = pd.DataFrame({
    "text": rb_df.inputs.transform(lambda r: r["text"]),
    "label": rb_df.annotation,
})

Architecture

Rubrix main components are:

  • Rubrix Python client: Python client to log, load, copy and delete Rubrix datasets.
  • Rubrix server: FastAPI REST service for reading and writing data.
  • Elasticsearch: The storage layer and search engine powering the API and the web app.
  • Rubrix web app: Easy-to-use web application for data exploration and annotation.

Community

As a new open-source project, we are eager to hear your thoughts, fix bugs, and help you get started. Feel free to use the Discussion forum or the Issues and we'll be pleased to help out.

Contributors