Skip to content

Commit

Permalink
introduce mixed_error_rate (#113)
Browse files Browse the repository at this point in the history
Co-authored-by: Aaron Ng <[email protected]>
  • Loading branch information
aaronng91 and Aaron Ng authored Dec 18, 2024
1 parent 1bd6e7f commit 4b00140
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 0 deletions.
1 change: 1 addition & 0 deletions asr_metrics/examples/hypothesis-mer.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Gary and Paul O'Donovan from Skibbereen. Olympic 金牌获得者. How does it feel? It's fantastic. 我们还没有太多鸡蛋来接受这一个事实. We wanted to win the 金牌 and come over with a 银牌. Like we're just so 开心. We can't complain with that. Paul Go 对我来说就是 as fast as you can and 拉 like a dog. 第四 at 500. 第四 at 1000. You 拉得非常用力 in that 最后阶段 of that race. We did. Josh 我们有点那么失望我们没能带回 金牌. I think we 把它交给 the French as best we could and we 冲到接近终点 there at the end. I'd say 我们整个车道都在跑, but we're 害怕回家了 now 因为 Mick Conlan 说他会打我们如果我们没能拿到 金牌.
1 change: 1 addition & 0 deletions asr_metrics/examples/reference-mer.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Gary and Paul O'Donovan from Skibbereen. Olympic 银牌获得者. How does it feel? It's fantastic. 我们还没有太多机会来接受这个事实. We wanted to win the 金牌 and 带回一块银牌, like we're just so 开心. We can't complain with that. Paul ​go ​从 ​A 到 ​B as fast as you can and 拉 like a dog. 第四 at 500. 第四 at 1000. You 在最后的阶段 拉得非常用力. We did. Josh 我们有一点失望没能带回 金牌. I think we 尽力把它交给 the French as best we could and we 冲到接近终点 there at the end. I'd say 我们整个车道都在跑, but we're 害怕回家了 now 因为 Mick Conlan 说他会打我们 if we 没能拿到 金牌 so.
26 changes: 26 additions & 0 deletions asr_metrics/wer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from collections import Counter
import argparse
import pandas as pd
import re

from jiwer import compute_measures, cer
from asr_metrics.wer.normalizers import BasicTextNormalizer, EnglishTextNormalizer
Expand Down Expand Up @@ -249,6 +250,17 @@ def check_paths(ref_path, hyp_path) -> Tuple[list[str], list[str]]:
raise ValueError("Unexpected file type. Please ensure files of the same type")


def add_space_between_cjk(input_str) -> str:
# Regular expression to match CJK characters
cjk_pattern = r"([\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff])"

# Add a space before and after each CJK character
spaced_str = re.sub(cjk_pattern, r" \1 ", input_str)

# Replace multiple spaces with a single space and strip leading/trailing spaces
return " ".join(spaced_str.split())


def get_wer_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
parser.add_argument(
"--non-en", help="Indicate the language is NOT english", action="store_true"
Expand Down Expand Up @@ -279,6 +291,14 @@ def get_wer_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
""",
action="store_true",
)
parser.add_argument(
"--mixed-error-rate",
help="""
Compute Mixed Error Rate instead of Word Error Rate.
Each English word and Chinese character is treated as 1 unit.
""",
action="store_true",
)
parser.add_argument(
"--keep-disfluencies",
help="Retain disfluencies such as 'uhhh' and 'umm'. Disfluencies will be removed otherwise.",
Expand Down Expand Up @@ -332,6 +352,9 @@ def main(args: Optional[argparse.Namespace] = None):
if args.cer is True:
differ, stats = run_cer(norm_ref, norm_hyp)
else:
if args.mixed_error_rate is True:
norm_ref = add_space_between_cjk(norm_ref)
norm_hyp = add_space_between_cjk(norm_hyp)
differ, stats = run_wer(norm_ref, norm_hyp)

stats["file name"] = hyp
Expand All @@ -355,6 +378,9 @@ def main(args: Optional[argparse.Namespace] = None):
ignore_index=True,
)

if args.mixed_error_rate is True:
results.rename(columns={"wer": "mixed_error_rate"}, inplace=True)

print(results.to_markdown(index=False))

if args.csv:
Expand Down

0 comments on commit 4b00140

Please sign in to comment.