Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
run.py		run.py
senters.py		senters.py

Repository files navigation

Comparison of Japanese Sentence Segmentation Tools

Requirements

Python: 3.10

Installation

pipenv install

Run

pipenv run python run.py

Benchmark

Data¹:
- wikipedia: 15 documents sampled from Japanese Wikipedia.
- cc100: 15 documents sampled from CC-100 (web text).
- emoji
  - Example input: "もちろん大丈夫です👍よろしくお願いします。"
  - Expected output: ["もちろん大丈夫です👍", "よろしくお願いします。"]
- kaomoji
  - Example input: "いいですよ^^よろしくお願いします。"
  - Expected output: ["いいですよ^^", "よろしくお願いします。"]
- named_entity
  - Example input: "モーニング娘。は日本のアイドルグループです。"
  - Expected output: ["モーニング娘。は日本のアイドルグループです。"]
- new_line
  - Example input: "時間は現在調整中ですので決まり次第\nご連絡差し上げます。"
  - Expected output: ["時間は現在調整中ですので決まり次第\nご連絡差し上げます。"]
Evaluation metric: F1 (micro average)

Tool	Method	wikipedia	cc100	emoji	kaomoji	named_entity	new_line
pysbd	Rule-based	100.0	85.5	0.0	0.0	0.0	44.4
rhoknp	Rule-based	100.0	88.4	0.0	0.0	0.0	44.4
kuzukiri	Rule-based	100.0	85.3	0.0	0.0	72.7	44.4
hasami	Rule-based	94.8	86.2	0.0	0.0	72.7	44.4
sengiri	Rule-based	55.7	68.1	12.9	0.0	56.0	44.4
bunkai	Rule-based + Model-based	93.7	83.7	100.0	66.7	0.0	100.0
ginza (ja_ginza_electra)	Model-based	95.7	85.7	66.7	84.2	75.0	70.0

Annotation has been done by the repository owner. ↩

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%