-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-lingual training #257
Comments
You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt). |
@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset. How about other three questions? |
According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English. |
@traderpedroso Thanks for reply.
|
I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language. From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome. |
@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.
|
Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder. I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios. In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc. |
@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful. |
@traderpedroso How many hours of audio data did you use for training? |
6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio |
That is much less audio data than I expected. For the |
Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800. |
@traderpedroso Thanks for your insights.
|
hey! were you able to build the PL-BERT model for hindi? i seem to be in the same situation as you. |
You need to add silence padding to your audio before training. I added 500ms to the beginning and end of the audio file. Then, during inference, I implemented a workaround with. def trim_audio(audio_np_array, sample_rate=24000, trim_ms=350):
trim_samples = int(trim_ms * sample_rate / 1000)
if len(audio_np_array) > 2 * trim_samples:
trimmed_audio_np = audio_np_array[trim_samples:-trim_samples]
else:
trimmed_audio_np = audio_np_array
return trimmed_audio_np
def tts(input: str, voice="Bia", output_sample_rate=24000, alpha=0.7, beta=0.7, diffusion_steps=5, embedding_scale=2, output_wav_file=None):
text = normalizer(input)
if text.strip() == "":
raise ValueError("insert some text")
if len(text) > 50000:
raise ValueError("max 50.000 tokens")
texts = split_sentence(text)
audios = []
for t in texts:
audio = styletts2importable.inference(
t,
voices[voice],
alpha=alpha,
beta=beta,
diffusion_steps=diffusion_steps,
embedding_scale=embedding_scale,
)
trimmed_audio = trim_audio(audio)
audios.append(trimmed_audio)
output_audio = np.concatenate(audios)
if output_wav_file:
scipy.io.wavfile.write(output_wav_file, rate=output_sample_rate, data=output_audio)
return output_sample_rate, output_audio
|
@traderpedroso Thanks for detailed answer and code, this is very helpful. |
@tanishbajaj101 I have trained Hindi StyleTTS2 model, with existing English BERT model and it seems to be working fine without any issue. So I have not yet explored Hindi PL-BERT model. |
I'm building a dataset creator for WebUI using models to recognize speakers, segment audio, and detect silence for cuts and padding. Using Whisper alone for cutting audio isn't ideal, it's hard to get good quality cuts. Doing it manually is a lot of work! I found some models on Hugging Face that might help, so I'm hoping to develop something that makes fine-tuning easier for everyone. If I get something working well, I'll share it here. Thanks, and see you later! |
If I want to train a new language model. There are a few steps I need to follow:
|
@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results). If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models. I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went. EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork. Thanks again! |
Yes, I trained with Brazilian Portuguese and Italian, and I had a lot of success in the results using PL-BERT. You don't need to change anything in the code, just replace everything in the /Utils/PLBERT/ folder with the multilingual version. Another thing I noticed is that it's better to have a few audios with perfect cuts than hundreds of audios with cuts that generate noise. So, padding at the beginning and end of the audio is extremely necessary. I was a bit short on time these days, but next week I'll make a Gradio available to apply the cuts and make it easier. As I mentioned, training each time with audios of the same size generates better results. The English model in fine-tuning with only 5 epochs with 15 minutes of audio already starting SLM active had undeniable results, always using the rule of audios of the same size, the minimum of 4 seconds. |
This comment was marked as resolved.
This comment was marked as resolved.
does anyone have an easy to follow jupyter notebook or webui? |
Not for multilingual but for single language I have created 2 notebooks here: #144 The training notebook is easily adaptable to multilingual by simply exchanging the PL-BERT subfolder in the Utils folder by the multilingual one, or one that you trained yourself. For example, I used https://huggingface.co/gerulata/slovakbert for Slovak language. |
For a single-language, multi-speaker dataset of approximately 1k hours, primarily consisting of Hindi audiobook recordings, would you recommend training a model from scratch or fine-tuning? |
@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :) |
I suggest you use the official one, I made some workarounds to make it work with phonemizer in Brazilian Portuguese, you can usually create your train list and validation list already converted to phonemes as I was using for testing custom phonemes, I ended up modifying a lot of mine and I believe it won't be useful in your case, just to be clear, AuxiliaryASR training will improve pronunciation in the language you train it on, and it's not necessary for English. However, if you want to use mine, simply modify meldataset.py where you have |
Thanks for your detailed explanation! It's helpful for me. |
Hi :) I'm trying your approach and have a question about the 2-4 seconds training. I'm doing 2nd stage and have config set like this:
What I found is that for about first 25 epochs, the quality of the model was actually really good. But after that, especially when the joint epoch started to kick in, I see a large degradation in the output quality (some letters not pronounced at all etc.). So, my question is - what settings did you use for the 2-4s training, both 1st and 2nd stage? If you remember / can disclose this. EDIT: I understand that the 2-4s training is not really about quality but I'm asking since I'm getting these inconsistent results Thanks! |
OK, so I tried this approach with a well-prepared custom English dataset but the results were not very good. While the humanization worked reasonably well, the quality wasn't too good. There were quite a few artifacts when I finetuned the 2-4s model. I then tried to continue training the 2-4s model but with 7-9s data. The quality improved a lot (much more than when I tried to finetune). However, I was still hearing artifacts in the audio (but I've only trained with 7-9s wavs for like 17 epochs, so maybe it would go away eventually). |
Sorry for the late response. I used 2 to 4 seconds only to speed up the training, as I couldn't train for a long time without wasting a lot of resources. Therefore, when you train with 2 to 4 seconds for 50 epochs, after that you will use the model as a base and retrain it for 10 epochs. The noise, undoubtedly, was caused by the sudden cuts in some of your audio. It's better to have a smaller amount of quality data than a large amount of data with sudden cuts. You need to ensure that you have silence after the pronunciation of the last word. I'll give you an example: it's not just about adding silence after the cut, but ensuring that you have at least 100ms of silence in your audio after the pronunciation of the last word. Then you add padding, as I previously mentioned. I used 500ms. Then, during inference, you need to make a post-cut in the audio to remove the amount of silence. In my training, my output became perfect, just like the original. |
200 hrs of audio data
…On Sat, 12 Oct 2024, 07:34 Emerson Pedroso, ***@***.***> wrote:
@traderpedroso <https://github.com/traderpedroso> How many hours of audio
data did you use for training?
6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for
10 epochs the first time i trained with len 300 30 epochs was bad quality
after that i did a finetunning the same model for 10 epochs with 800 len
after se second epoch was generating perfect audio
That is much less audio data than I expected. For the len that you
changed, do you mean the max_len in the config?
Yes, max_len of 800, but I found a more efficient way to train the fourth
model that I trained. Now I followed this approach: first, I trained the
model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len
of 300. Of course, the final quality wasn't interesting, but it was
perfectly trained for 50 epochs in less than 2 hours. Then I did the
finetuning for 5 epochs with audio of the "same length" of 8 seconds. The
model turned out perfect, with zero noise in the end, and a smooth
pronunciation. It became very humanized and much better, and I spent fewer
resources on the training. The 8-second audios can be from various speakers
with a maximum of 80 seconds each. In my case, I trained with 50 speakers
and the fine-tuning only one hour dataset with max_len 800.
OK, so I tried this approach with a well-prepared custom English dataset
but the results were not very good. While the humanization worked
reasonably well, the quality wasn't too good. There were quite a few
artifacts when I finetuned the 2-4s model.
I then tried to continue training the 2-4s model but with 7-9s data. The
quality improved a lot (much more than when I tried to finetune). However,
I was still hearing artifacts in the audio (but I've only trained with 7-9s
wavs for like 17 epochs, so maybe it would go away eventually).
Sorry for the late response. I used 2 to 4 seconds only to speed up the
training, as I couldn't train for a long time without wasting a lot of
resources. Therefore, when you train with 2 to 4 seconds for 50 epochs,
after that you will use the model as a base and retrain it for 10 epochs.
The noise, undoubtedly, was caused by the sudden cuts in some of your
audio. It's better to have a smaller amount of quality data than a large
amount of data with sudden cuts. You need to ensure that you have silence
after the pronunciation of the last word.
I'll give you an example: it's not just about adding silence after the
cut, but ensuring that you have at least 100ms of silence in your audio
after the pronunciation of the last word. Then you add padding, as I
previously mentioned. I used 500ms. Then, during inference, you need to
make a post-cut in the audio to remove the amount of silence. In my
training, my output became perfect, just like the original.
—
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOSXU5CELRSFPHAI73KAP23Z3B7TPAVCNFSM6AAAAABJYRBF3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBYGMYDGNZRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Oh, I didn't mean that I got artifacts towards the end of audio, as people who did finetune StyleTTS2 without the 100ms silence at the end experienced. I mean when I finetuned, there were garbled letters and some wild variations in the output thorough the full inference. Only when I continued the training from 100th Epoch of the 2-4s model and used 7-9s wavs for that, these started to fade away. But thanks for the insight :) |
Thanks for wonderful work which gives good expressive TTS for English speakers. I was planning for Indian Multi-lingual TTS. For this purpose, I have few questions.
The text was updated successfully, but these errors were encountered: