Further support for French #78

pguyot · 2019-10-01T19:56:22Z

Further support for French which allowed me to build a reasonable model for French (tdnn 250, tdnn f still being built).

%WER 30.17 [ 42464 / 140755, 3870 ins, 12822 del, 25772 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_7_0.0

WER is quite high (higher than previously reported), however this probably is because previous model WER was poorly computed (against a small set of test voices) and this new model is based on a lot of noisy corpora.

When sequitur model is used, transcripts should be split with language-specific tokenizer

pguyot · 2019-10-01T21:26:59Z

Resulting models are available here:
https://github.com/pguyot/zamia-speech/releases/tag/20190930

joazoa · 2019-10-08T08:22:13Z

Thanks for the great work pguyot!

As mentioned in the other ticket, I have been trying to make the french models based on the zamia master branch. (I didn't see your most recent commits until last night :/

What i meant with the hanging import from est_republicain, this is what i meant:
When i tried to run ./speech_sentences.py est_republicain, I noticed that this script seemed to hang (it doesn't actually hang just goes really slowly according to strace). I left it for a couple of hours before giving up.

On my setup after running xmllint --xpath '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee*/*.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt, the result is a 1.8GB file without newlines.

I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now.

I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt.

The .ipa i was referring to is the french dictionary that comes with Zamai, it is missing pronunciations for these words: hypnoanalgésies, ibogaïne, ibogaïnes, malabéen, malabéenne, malabéennes, malabéens, patagonien, patagonienne, patagoniennes, patagoniens, sulfamidé, sulfamidée, sulfamidées, sulfamidés, théophanique, théophaniques, xavière, xavières

For now I just deleted those lines as I'm first trying to get it to actually train before doing things more properly. I just saw you commit some sources a week ago that I had not seen yet, and it includes a newer dictionary with different encoding, I will be sure to give that a try.

The next issue I encountered was the import_cv_fr script returning uttid %s ist not unique!, i'm not at that PC at the moment but I believe I was able to patch it somehow to make it work, although my resulting file is not the same as yours.

I did not fix this yet: "create one dir per utt (since we have no speaker information)"- as the cv_fr files do seem to contain the speaker information.
My cv_fr, voxforge_fr and m_ailabs_fr spk_test.txt are all empty, which i presume may also the reason why i have no test samples when i start the training (and why the training failes).

Are you generating those files manually or do you use GenerateCorpora for tha or some other way to split into train / dev / test ?

I saw you also wrote import scripts for some more corpora, that was going to be my plan for the coming week so you already saved me quite some time there. !!!

I may have some more text and audio corpora to add to the list.

pguyot · 2019-10-10T06:16:03Z

I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now.

Could you please precisely suggest a revision of the README.md file to explain the sed trick? To be completely transparent, I did not really use the xmllint line but cooked this as a replacement for a more complex combination of scripts and text editing I used relying on a lot of third-party dependencies that seemed overkill.

I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt.

Umm. I realize I invoke speech_sentences.py with -p option. Is it at this step that you need to add quality to the csv transcripts? We could simply document the -p case in README.md.

I believe the ipa file in this pull request does not have these errors.

Likewise, I fixed the spk_test.txt files in this pull request, which is why the WER increased compared to my previous attempt, yet the error rates on real world sounds decreased significantly (it's not perfect, but it looks like it is recognizing something).

To generate the spk_test.txt files, I found out that Guenter had a specific proportion of speakers (5% if I remember properly) and I picked them randomly.

Please be aware of the license of audio and text corpora. The corpora I added are available under a CC-BY-NC-SA license, which is fine for me but represent a stronger constraint compared to Mozilla CV.

pguyot · 2019-10-16T20:08:50Z

Finally finished the tdnn_f model.

%WER 25.40 [ 35749 / 140755, 3408 ins, 11742 del, 20599 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_7_0.0

Downloadable from the same link.
https://github.com/pguyot/zamia-speech/releases/tag/20190930

a-rose · 2021-03-16T08:45:37Z

Hello,

I'm bumping this because I'm interested in building french models. I managed to build a small model using this branch and a subset of the dataset. Are there any outstanding issues preventing this from getting merged I could look into ?

pguyot added 9 commits October 1, 2019 21:49

New option -l to specify language (and tokenizer) in kaldi export script

e06b7db

When sequitur model is used, transcripts should be split with language-specific tokenizer

Fix and validate several entries

5c23012

New script for Mozilla CV in French

47436ef

Mozilla CV Fr transcripts

6259885

New script to import .trs files

870ad40

Additional corpora for French

36fca7e

Transcript and IPA fixes and additions

62e624d

Process mp4 videos

40ed94f

Select speakers for tests

edd2b1c

pguyot mentioned this pull request Oct 7, 2019

Hardware requirements to trains Kaldi models #72

Open

joazoa mentioned this pull request Jan 21, 2020

install french model #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further support for French #78

Further support for French #78

pguyot commented Oct 1, 2019

pguyot commented Oct 1, 2019

joazoa commented Oct 8, 2019

pguyot commented Oct 10, 2019

pguyot commented Oct 16, 2019

a-rose commented Mar 16, 2021

Further support for French #78

Are you sure you want to change the base?

Further support for French #78

Conversation

pguyot commented Oct 1, 2019

pguyot commented Oct 1, 2019

joazoa commented Oct 8, 2019

pguyot commented Oct 10, 2019

pguyot commented Oct 16, 2019

a-rose commented Mar 16, 2021