introduce mixed_error_rate

speechmatics · Dec 18, 2024 · f7be0ac · f7be0ac
1 parent 1bd6e7f
commit f7be0ac
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 0 deletions.
diff --git a/asr_metrics/examples/hypothesis-mer.txt b/asr_metrics/examples/hypothesis-mer.txt
@@ -0,0 +1 @@
+Gary and Paul O'Donovan from Skibbereen. Olympic 金牌获得者. How does it feel? It's fantastic. 我们还没有太多鸡蛋来接受这一个事实. We wanted to win the 金牌 and come over with a 银牌. Like we're just so 开心. We can't complain with that. Paul Go 对我来说就是 as fast as you can and 拉 like a dog. 第四 at 500. 第四 at 1000. You 拉得非常用力 in that 最后阶段 of that race. We did. Josh 我们有点那么失望我们没能带回 金牌. I think we 把它交给 the French as best we could and we 冲到接近终点 there at the end. I'd say 我们整个车道都在跑, but we're 害怕回家了 now 因为 Mick Conlan 说他会打我们如果我们没能拿到 金牌.
diff --git a/asr_metrics/examples/reference-mer.txt b/asr_metrics/examples/reference-mer.txt
@@ -0,0 +1 @@
+Gary and Paul O'Donovan from Skibbereen. Olympic 银牌获得者. How does it feel? It's fantastic. 我们还没有太多机会来接受这个事实. We wanted to win the 金牌 and 带回一块银牌, like we're just so 开心. We can't complain with that. Paul go 从 A 到 B as fast as you can and 拉 like a dog. 第四 at 500. 第四 at 1000. You 在最后的阶段 拉得非常用力. We did. Josh 我们有一点失望没能带回 金牌. I think we 尽力把它交给 the French as best we could and we 冲到接近终点 there at the end. I'd say 我们整个车道都在跑, but we're 害怕回家了 now 因为 Mick Conlan 说他会打我们 if we 没能拿到 金牌 so.
diff --git a/asr_metrics/wer/__main__.py b/asr_metrics/wer/__main__.py
@@ -9,6 +9,7 @@
 from collections import Counter
 import argparse
 import pandas as pd
+import re
 
 from jiwer import compute_measures, cer
 from asr_metrics.wer.normalizers import BasicTextNormalizer, EnglishTextNormalizer
@@ -249,6 +250,17 @@ def check_paths(ref_path, hyp_path) -> Tuple[list[str], list[str]]:
     raise ValueError("Unexpected file type. Please ensure files of the same type")
 
 
+def add_space_between_cjk(input_str) -> str:
+    # Regular expression to match CJK characters
+    cjk_pattern = r'([\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff])'
+
+    # Add a space before and after each CJK character
+    spaced_str = re.sub(cjk_pattern, r' \1 ', input_str)
+
+    # Replace multiple spaces with a single space and strip leading/trailing spaces
+    return ' '.join(spaced_str.split())
+
+
 def get_wer_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
     parser.add_argument(
         "--non-en", help="Indicate the language is NOT english", action="store_true"
@@ -279,6 +291,14 @@ def get_wer_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
         """,
         action="store_true",
     )
+    parser.add_argument(
+        "--mixed-error-rate",
+        help="""
+        Compute Mixed Error Rate instead of Word Error Rate.
+        Each English word and Chinese character is treated as 1 unit.
+        """,
+        action="store_true",
+    )
     parser.add_argument(
         "--keep-disfluencies",
         help="Retain disfluencies such as 'uhhh' and 'umm'. Disfluencies will be removed otherwise.",
@@ -332,6 +352,9 @@ def main(args: Optional[argparse.Namespace] = None):
         if args.cer is True:
             differ, stats = run_cer(norm_ref, norm_hyp)
         else:
+            if args.mixed_error_rate is True:
+                norm_ref = add_space_between_cjk(norm_ref)
+                norm_hyp = add_space_between_cjk(norm_hyp)
             differ, stats = run_wer(norm_ref, norm_hyp)
 
         stats["file name"] = hyp
@@ -355,6 +378,9 @@ def main(args: Optional[argparse.Namespace] = None):
             ignore_index=True,
         )
 
+    if args.mixed_error_rate is True:
+        results.rename(columns={"wer": "mixed_error_rate"}, inplace=True)
+
     print(results.to_markdown(index=False))
 
     if args.csv:
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Gary and Paul O'Donovan from Skibbereen. Olympic 金牌获得者. How does it feel? It's fantastic. 我们还没有太多鸡蛋来接受这一个事实. We wanted to win the 金牌 and come over with a 银牌. Like we're just so 开心. We can't complain with that. Paul Go 对我来说就是 as fast as you can and 拉 like a dog. 第四 at 500. 第四 at 1000. You 拉得非常用力 in that 最后阶段 of that race. We did. Josh 我们有点那么失望我们没能带回金牌. I think we 把它交给 the French as best we could and we 冲到接近终点 there at the end. I'd say 我们整个车道都在跑, but we're 害怕回家了 now 因为 Mick Conlan 说他会打我们如果我们没能拿到金牌.