You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training the Marylux-648-Corpus with the COQUI-STT model, the generated alphabet includes the character \xad, which probably disturbs the model. This Unicode codepoint is a soft hyphen (invisible in a text editor) used for the purpose of breaking words across lines by inserting visible hyphens in typesetting.
To find these soft hyphens in the Marylux transcriptions, i created the following small Python script:
There is only one sentence in the Marylux-dataset containing such a soft hyphen: lb-wiki-0024.
marylux_lb-wiki-0024|D'Welt huet dausend-an-eng Steckdousen (E puer Gedankespréng iwwert d'Liesen) Concours littéraire national pour essais littéraires mille-neuf-cent-quatre-vingt-dix-huit.|d'welt huet dausend-an-eng steckdousen (e puer gedankespréng iwwert d'liesen) concours littéraire national pour essais littéraires mille-neuf-cent-quatre-vingt-dix-huit.
By adding the code row[2] = row[2].replace(('\xad', '') before the print statement in the above script I removed the soft hyphens in this sample.
The text was updated successfully, but these errors were encountered:
When training the Marylux-648-Corpus with the COQUI-STT model, the generated alphabet includes the character
\xad
, which probably disturbs the model. This Unicode codepoint is a soft hyphen (invisible in a text editor) used for the purpose of breaking words across lines by inserting visible hyphens in typesetting.To find these soft hyphens in the Marylux transcriptions, i created the following small Python script:
There is only one sentence in the Marylux-dataset containing such a soft hyphen: lb-wiki-0024.
By adding the code
row[2] = row[2].replace(('\xad', '')
before the print statement in the above script I removed the soft hyphens in this sample.The text was updated successfully, but these errors were encountered: