Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with soft-hyphen #18

Open
mbarnig opened this issue Jan 15, 2022 · 0 comments
Open

Problem with soft-hyphen #18

mbarnig opened this issue Jan 15, 2022 · 0 comments
Assignees

Comments

@mbarnig
Copy link
Owner

mbarnig commented Jan 15, 2022

When training the Marylux-648-Corpus with the COQUI-STT model, the generated alphabet includes the character \xad, which probably disturbs the model. This Unicode codepoint is a soft hyphen (invisible in a text editor) used for the purpose of breaking words across lines by inserting visible hyphens in typesetting.

To find these soft hyphens in the Marylux transcriptions, i created the following small Python script:

import csv
filename = "/home/mbarnig/DOCKER/clips/stt-marylux-text.csv"
with open(filename, 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter = ',')
    for row in csvreader:
          if '\xad' in row[2]:     
              print(row[2])
csvfile.close()

There is only one sentence in the Marylux-dataset containing such a soft hyphen: lb-wiki-0024.

marylux_lb-wiki-0024|­D'Welt huet dausend-an-eng Steckdousen (E puer Gedankespréng iwwert d'Liesen) Concours littéraire national pour essais littéraires mille-neuf-cent-quatre-vingt-dix-huit.|­d'welt huet dausend-an-eng steckdousen (e puer gedankespréng iwwert d'liesen) concours littéraire national pour essais littéraires mille-neuf-cent-quatre-vingt-dix-huit.

By adding the code row[2] = row[2].replace(('\xad', '') before the print statement in the above script I removed the soft hyphens in this sample.

@mbarnig mbarnig self-assigned this Jan 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant