Skip to content

Commit

Permalink
Section 3 improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
nataliaElv committed Nov 20, 2024
1 parent 4d47d89 commit 5b03fdd
Showing 1 changed file with 17 additions and 5 deletions.
22 changes: 17 additions & 5 deletions chapters/en/chapter10/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

Depending on the NLP task that you're working with and the specific use case or application, your data and the annotation task will look differently. For this section of the course, we'll use [a dataset collecting news](https://huggingface.co/datasets/SetFit/ag_news) to complete two tasks: a text classification on the topic of each text and a token classification to identify the named entities mentioned.

<iframe
src="https://huggingface.co/datasets/SetFit/ag_news/embed/viewer/default/train"
frameborder="0"
width="100%"
height="560px"
></iframe>

It is possible to import datasets from the Hub using the Argilla UI directly, but we'll be using the SDK to learn how we can make further edits to the data if needed.

## Configure your dataset
Expand All @@ -25,15 +32,17 @@ We can now think about the settings of our dataset in Argilla. These represent t
```python
data = load_dataset("SetFit/ag_news", split="train")
data.features()
````
```

These are the features of our dataset:

```python out
{'text': Value(dtype='string', id=None),
'label': Value(dtype='int64', id=None),
'label_text': Value(dtype='string', id=None)}
```

Our dataset contains a `text` and also some initial labels for the text classification. We'll add those to our dataset settings together with a `spans` question for the named entities:
It contains a `text` and also some initial labels for the text classification. We'll add those to our dataset settings together with a `spans` question for the named entities:

```python
settings = rg.Settings(
Expand All @@ -58,8 +67,9 @@ settings = rg.Settings(

Let's dive a bit deeper into what these settings mean. First, we've defined **fields**, these include the information that we'll be annotating. In this case, we only have one field and it comes in the form of a text, so we've choosen a `TextField`.

Then, we define **questions** that represent the tasks that we want to perform on our data:
- For the text classification task we've chosen a `LabelQuestion` and we used the unique values of the `label_text` column as our labels, to make sure that the question is compatible with the labels that already exist in the dataset.
Then, we define **questions** that represent the tasks that we want to perform on our data:

- For the text classification task we've chosen a `LabelQuestion` and we used the unique values of the `label_text` column as our labels, to make sure that the question is compatible with the labels that already exist in the dataset.
- For the token classification task, we'll need a `SpanQuestion`. We've defined a set of labels that we'll be using for that task, plus the field on which we'll be drawing the spans.

To learn more about all the available types of fields and questions and other advanced settings, like metadata and vectors, go to the [Argilla docs](https://docs.argilla.io/latest/how_to_guides/dataset/#define-dataset-settings).
Expand All @@ -83,4 +93,6 @@ The dataset now appears in our Argilla instance, but you will see that it's empt
dataset.records.log(data, mapping={"label_text": "label"})
```

Now your dataset is ready to start annotating!
In our mapping, we've specified that the `label_text` column in the dataset should be mapped to the question with the name `label`. In this way, we'll use the existing labels in the dataset as pre-annotations so we can annotate faster.

Now our dataset is ready to start annotating!

0 comments on commit 5b03fdd

Please sign in to comment.