Skip to content

hkiyomaru/ja-senter-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparison of Japanese Sentence Segmentation Tools

Requirements

  • Python: 3.10

Installation

pipenv install

Run

pipenv run python run.py

Benchmark

  • Data1:
    • wikipedia: 15 documents sampled from Japanese Wikipedia.
    • cc100: 15 documents sampled from CC-100 (web text).
    • emoji
      • Example input: "もちろん大丈夫です👍よろしくお願いします。"
      • Expected output: ["もちろん大丈夫です👍", "よろしくお願いします。"]
    • kaomoji
      • Example input: "いいですよ^^よろしくお願いします。"
      • Expected output: ["いいですよ^^", "よろしくお願いします。"]
    • named_entity
      • Example input: "モーニング娘。は日本のアイドルグループです。"
      • Expected output: ["モーニング娘。は日本のアイドルグループです。"]
    • new_line
      • Example input: "時間は現在調整中ですので決まり次第\nご連絡差し上げます。"
      • Expected output: ["時間は現在調整中ですので決まり次第\nご連絡差し上げます。"]
  • Evaluation metric: F1 (micro average)
Tool Method wikipedia cc100 emoji kaomoji named_entity new_line
pysbd Rule-based 100.0 85.5 0.0 0.0 0.0 44.4
rhoknp Rule-based 100.0 88.4 0.0 0.0 0.0 44.4
kuzukiri Rule-based 100.0 85.3 0.0 0.0 72.7 44.4
hasami Rule-based 94.8 86.2 0.0 0.0 72.7 44.4
sengiri Rule-based 55.7 68.1 12.9 0.0 56.0 44.4
bunkai Rule-based + Model-based 93.7 83.7 100.0 66.7 0.0 100.0
ginza (ja_ginza_electra) Model-based 95.7 85.7 66.7 84.2 75.0 70.0

Footnotes

  1. Annotation has been done by the repository owner.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages