Skip to content

Commit

Permalink
Fix metrics submodules, enforce utf-8 encoding
Browse files Browse the repository at this point in the history
  • Loading branch information
Dan Cochrane committed Dec 13, 2023
1 parent 848f249 commit bee43e1
Show file tree
Hide file tree
Showing 5 changed files with 15 additions and 7 deletions.
2 changes: 1 addition & 1 deletion metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We provide some additional tooling to help benchmark transcription and diarizati

### CLI

The `sm-metrics` binary is built after installing with PyPI or running `python3 setup.py` from the source code. To see the options from the command-line, use the following:
The `sm-metrics` binary is built after installing with PyPI or running `python3 setup.py install` from the source code. To see the options from the command-line, use the following:
``` bash
sm-metrics -h
```
Expand Down
2 changes: 1 addition & 1 deletion metrics/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ def main():

# Create subparsers
subparsers = parser.add_subparsers(
dest="mode", help="Metrics mode. Choose from 'wer' or 'diarization"
dest="mode", help="Metrics mode. Choose from 'wer' or 'diarization'"
)
subparsers.required = True # Make sure a subparser id always provided

Expand Down
2 changes: 2 additions & 0 deletions metrics/wer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ To see all the commands, run:
sm-metrics wer -h
```

You must ensure that both the reference and hypothesis files are encoded in UTF-8.

## Read More

- [The Future of Word Error Rate](https://www.speechmatics.com/company/articles-and-news/the-future-of-word-error-rate?utm_source=facebook&utm_medium=social&fbclid=IwAR1z7ZU4WowgDBs91MNKFTwPACD9gb7dkrQpkr1HmfsgXPv-Ndt5PeySjIk&restored=1676632411598)
Expand Down
9 changes: 7 additions & 2 deletions metrics/wer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,13 @@ def load_file(path: Path, file_type: str) -> str:


def load_text(path: Path) -> str:
with open(path, "r", encoding="utf-8") as input_path:
return input_path.read()
try:
with open(path, "r", encoding="utf-8") as input_path:
return input_path.read()
except UnicodeDecodeError as error:
raise ValueError(
f"Error reading file {path}: {error}. Ensure the file is UTF-8 encoded."
)


def load_sm_json(path: Path) -> str:
Expand Down
7 changes: 4 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import os
import logging

from setuptools import setup
from setuptools import setup, find_packages


def read(fname):
Expand Down Expand Up @@ -55,11 +55,12 @@ def get_version(fname):


logging.basicConfig(level=logging.INFO)

print(f"Packages to install: {find_packages(exclude=['tests'])}")
setup(
name="speechmatics-python",
version=os.getenv("VERSION", get_version("VERSION")),
packages=["speechmatics", "metrics"],
packages=find_packages(exclude=["tests"]),
package_data={"metrics": ["wer/normalizers/english.yaml"]},
url="https://github.com/speechmatics/speechmatics-python/",
license="MIT",
author="Speechmatics",
Expand Down

0 comments on commit bee43e1

Please sign in to comment.