Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further support for French #78

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

pguyot
Copy link
Contributor

@pguyot pguyot commented Oct 1, 2019

Further support for French which allowed me to build a reasonable model for French (tdnn 250, tdnn f still being built).

%WER 30.17 [ 42464 / 140755, 3870 ins, 12822 del, 25772 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_7_0.0

WER is quite high (higher than previously reported), however this probably is because previous model WER was poorly computed (against a small set of test voices) and this new model is based on a lot of noisy corpora.

@pguyot
Copy link
Contributor Author

pguyot commented Oct 1, 2019

Resulting models are available here:
https://github.com/pguyot/zamia-speech/releases/tag/20190930

@joazoa
Copy link

joazoa commented Oct 8, 2019

Thanks for the great work pguyot!

As mentioned in the other ticket, I have been trying to make the french models based on the zamia master branch. (I didn't see your most recent commits until last night :/

What i meant with the hanging import from est_republicain, this is what i meant:
When i tried to run ./speech_sentences.py est_republicain, I noticed that this script seemed to hang (it doesn't actually hang just goes really slowly according to strace). I left it for a couple of hours before giving up.

On my setup after running xmllint --xpath '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt, the result is a 1.8GB file without newlines.

I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now.

I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt.

The .ipa i was referring to is the french dictionary that comes with Zamai, it is missing pronunciations for these words: hypnoanalgésies, ibogaïne, ibogaïnes, malabéen, malabéenne, malabéennes, malabéens, patagonien, patagonienne, patagoniennes, patagoniens, sulfamidé, sulfamidée, sulfamidées, sulfamidés, théophanique, théophaniques, xavière, xavières

For now I just deleted those lines as I'm first trying to get it to actually train before doing things more properly. I just saw you commit some sources a week ago that I had not seen yet, and it includes a newer dictionary with different encoding, I will be sure to give that a try.

The next issue I encountered was the import_cv_fr script returning uttid %s ist not unique!, i'm not at that PC at the moment but I believe I was able to patch it somehow to make it work, although my resulting file is not the same as yours.

I did not fix this yet: "create one dir per utt (since we have no speaker information)"- as the cv_fr files do seem to contain the speaker information.
My cv_fr, voxforge_fr and m_ailabs_fr spk_test.txt are all empty, which i presume may also the reason why i have no test samples when i start the training (and why the training failes).

Are you generating those files manually or do you use GenerateCorpora for tha or some other way to split into train / dev / test ?

I saw you also wrote import scripts for some more corpora, that was going to be my plan for the coming week so you already saved me quite some time there. !!!

I may have some more text and audio corpora to add to the list.

@pguyot
Copy link
Contributor Author

pguyot commented Oct 10, 2019

I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now.

Could you please precisely suggest a revision of the README.md file to explain the sed trick? To be completely transparent, I did not really use the xmllint line but cooked this as a replacement for a more complex combination of scripts and text editing I used relying on a lot of third-party dependencies that seemed overkill.

I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt.

Umm. I realize I invoke speech_sentences.py with -p option. Is it at this step that you need to add quality to the csv transcripts? We could simply document the -p case in README.md.

I believe the ipa file in this pull request does not have these errors.

Likewise, I fixed the spk_test.txt files in this pull request, which is why the WER increased compared to my previous attempt, yet the error rates on real world sounds decreased significantly (it's not perfect, but it looks like it is recognizing something).

To generate the spk_test.txt files, I found out that Guenter had a specific proportion of speakers (5% if I remember properly) and I picked them randomly.

Please be aware of the license of audio and text corpora. The corpora I added are available under a CC-BY-NC-SA license, which is fine for me but represent a stronger constraint compared to Mozilla CV.

@pguyot
Copy link
Contributor Author

pguyot commented Oct 16, 2019

Finally finished the tdnn_f model.

%WER 25.40 [ 35749 / 140755, 3408 ins, 11742 del, 20599 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_7_0.0

Downloadable from the same link.
https://github.com/pguyot/zamia-speech/releases/tag/20190930

@joazoa joazoa mentioned this pull request Jan 21, 2020
@a-rose
Copy link
Contributor

a-rose commented Mar 16, 2021

Hello,

I'm bumping this because I'm interested in building french models. I managed to build a small model using this branch and a subset of the dataset. Are there any outstanding issues preventing this from getting merged I could look into ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants