-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further support for French #78
base: master
Are you sure you want to change the base?
Conversation
When sequitur model is used, transcripts should be split with language-specific tokenizer
Resulting models are available here: |
Thanks for the great work pguyot! As mentioned in the other ticket, I have been trying to make the french models based on the zamia master branch. (I didn't see your most recent commits until last night :/ What i meant with the hanging import from est_republicain, this is what i meant: On my setup after running xmllint --xpath '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt, the result is a 1.8GB file without newlines. I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now. I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt. The .ipa i was referring to is the french dictionary that comes with Zamai, it is missing pronunciations for these words: hypnoanalgésies, ibogaïne, ibogaïnes, malabéen, malabéenne, malabéennes, malabéens, patagonien, patagonienne, patagoniennes, patagoniens, sulfamidé, sulfamidée, sulfamidées, sulfamidés, théophanique, théophaniques, xavière, xavières For now I just deleted those lines as I'm first trying to get it to actually train before doing things more properly. I just saw you commit some sources a week ago that I had not seen yet, and it includes a newer dictionary with different encoding, I will be sure to give that a try. The next issue I encountered was the import_cv_fr script returning uttid %s ist not unique!, i'm not at that PC at the moment but I believe I was able to patch it somehow to make it work, although my resulting file is not the same as yours. I did not fix this yet: "create one dir per utt (since we have no speaker information)"- as the cv_fr files do seem to contain the speaker information. Are you generating those files manually or do you use GenerateCorpora for tha or some other way to split into train / dev / test ? I saw you also wrote import scripts for some more corpora, that was going to be my plan for the coming week so you already saved me quite some time there. !!! I may have some more text and audio corpora to add to the list. |
Could you please precisely suggest a revision of the
Umm. I realize I invoke I believe the ipa file in this pull request does not have these errors. Likewise, I fixed the spk_test.txt files in this pull request, which is why the WER increased compared to my previous attempt, yet the error rates on real world sounds decreased significantly (it's not perfect, but it looks like it is recognizing something). To generate the spk_test.txt files, I found out that Guenter had a specific proportion of speakers (5% if I remember properly) and I picked them randomly. Please be aware of the license of audio and text corpora. The corpora I added are available under a CC-BY-NC-SA license, which is fine for me but represent a stronger constraint compared to Mozilla CV. |
Finally finished the tdnn_f model. %WER 25.40 [ 35749 / 140755, 3408 ins, 11742 del, 20599 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_7_0.0 Downloadable from the same link. |
Hello, I'm bumping this because I'm interested in building french models. I managed to build a small model using this branch and a subset of the dataset. Are there any outstanding issues preventing this from getting merged I could look into ? |
Further support for French which allowed me to build a reasonable model for French (tdnn 250, tdnn f still being built).
%WER 30.17 [ 42464 / 140755, 3870 ins, 12822 del, 25772 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_7_0.0
WER is quite high (higher than previously reported), however this probably is because previous model WER was poorly computed (against a small set of test voices) and this new model is based on a lot of noisy corpora.