-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag. Hashformers is the current state-of-the-art for hashtag segmentation. Hashformers is also language-agnostic: you can use it to segment hashtags not just in English, but also in any language on the Hugging Face Model Hub.
This quick start guide includes what most users will ever have to know about the library.
To begin, install hashformers using pip:
pip install hashformers
Once you have hashformers installed, you can load a hashtag segmenter using your preferred model. Here's an example using the GPT-2 model:
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
segmenter_model_type="gpt2"
)
segmentations = ws.segment([
"#weneedanationalpark",
"#icecold"
])
print(segmentations)
# [ 'we need a national park',
# 'ice cold' ]
Hashformers utilizes the minicons library to load models and calculate the log-likelihood of hashtag segmentations. You can use any model that can be loaded through the minicons library.
The following model types are available:
-
gpt2
: This is an alias kept for legacy purposes. It loads theIncrementalLMScorer
inminicons.scorer
. -
bert
: This is an alias kept for legacy purposes. It loads theMaskedLMScorer
inminicons.scorer
. -
incremental
: This loads aIncrementalLMScorer
. -
masked
: This loads aMaskedLMScorer
. -
seq2seq
: This loads aSeq2SeqScorer
.
For more information on the scorers, check the minicons repository.
While it is possible to use any model as the segmenter, based on our research, incremental models have demonstrated good results when used as the segmenter.
Suppose you want to segment hashtags in German. You can use a German GPT-2 model from the Hugging Face Model Hub:
ws = WordSegmenter(
segmenter_model_name_or_path="dbmz/german-gpt2",
segmenter_model_type="incremental"
)
You can also use large language models (LLMs) for hashtag segmentation. Here we combine GPT-J and Dolly:
ws = WordSegmenter(
segmenter_model_name_or_path="EleutherAI/gpt-j-6b",
segmenter_model_type="incremental",
reranker_model_name_or_path="databricks/dolly-v2-3b",
reranker_model_type="incremental"
)
You can also call scorers by their class name directly. This can be useful if minicons implements a scorer that is not listed above. Here's an example:
# Achieving the same result using class name
ws = WordSegmenter(
segmenter_model_name_or_path="dbmz/german-gpt2",
segmenter_model_type="IncrementalLMScorer"
)
What we have covered so far will be sufficient for most projects using the hashformers library: simply load a model as the segmenter (we recommend using GPT-2 or some other incremental model) and start segmenting hashtags in your projects.
The remainder of this documentation covers use cases that may be useful for advanced users or developers who are seeking ways to contribute to the library.
-
Frequently-Asked-Questions-(FAQ) : Frequently asked questions about the library.
-
Segmenters : In-depth documentation of the
TransformerWordSegmenter
class and other segmenter classes available in the library for hashtag segmentation, such asTweetSegmenter
andRegexWordSegmenter
. -
Benchmarking : How to benchmark your segmenters, and benchmarks for some common datasets.