Multi-lingual training #257

nvadigauvce · 2024-06-23T17:15:58Z

Thanks for wonderful work which gives good expressive TTS for English speakers. I was planning for Indian Multi-lingual TTS. For this purpose, I have few questions.

Do we need to change only data and PL-BERT model or any other changes required ?
can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

SandyPanda-MLDL · 2024-06-23T17:25:42Z

You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt).

nvadigauvce · 2024-06-24T03:46:56Z

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions?
2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

traderpedroso · 2024-06-25T09:45:09Z

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

nvadigauvce · 2024-06-25T16:53:33Z

@traderpedroso Thanks for reply.

Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

traderpedroso · 2024-06-26T04:32:55Z

@traderpedroso Thanks for reply.

Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?

Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language.

From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome.

nvadigauvce · 2024-06-26T04:49:52Z

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

traderpedroso · 2024-06-27T15:38:28Z

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?

For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?

Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder.

I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios.

In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc.

nvadigauvce · 2024-06-28T04:57:42Z

@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful.

mc-marcocheng · 2024-06-30T08:14:19Z

@traderpedroso How many hours of audio data did you use for training?

traderpedroso · 2024-07-02T08:10:44Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

mc-marcocheng · 2024-07-03T09:58:08Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

traderpedroso · 2024-07-08T13:39:42Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

nvadigauvce · 2024-07-08T15:39:29Z

@traderpedroso Thanks for your insights.

I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

tanishbajaj101 · 2024-07-19T12:19:14Z

@traderpedroso Thanks for your insights.

I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?

I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

hey! were you able to build the PL-BERT model for hindi? i seem to be in the same situation as you.

traderpedroso · 2024-07-23T20:06:22Z

@traderpedroso Thanks for your insights.

I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?

I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

You need to add silence padding to your audio before training. I added 500ms to the beginning and end of the audio file. Then, during inference, I implemented a workaround with.

def trim_audio(audio_np_array, sample_rate=24000, trim_ms=350):
    trim_samples = int(trim_ms * sample_rate / 1000)
    if len(audio_np_array) > 2 * trim_samples:
        trimmed_audio_np = audio_np_array[trim_samples:-trim_samples]
    else:
        trimmed_audio_np = audio_np_array
    return trimmed_audio_np

def tts(input: str, voice="Bia", output_sample_rate=24000, alpha=0.7, beta=0.7, diffusion_steps=5, embedding_scale=2, output_wav_file=None):
    text = normalizer(input)
    if text.strip() == "":
        raise ValueError("insert some text")
    if len(text) > 50000:
        raise ValueError("max 50.000 tokens")
    
    texts = split_sentence(text)
    audios = []
    for t in texts:
        audio = styletts2importable.inference(
            t,
            voices[voice],
            alpha=alpha,
            beta=beta,
            diffusion_steps=diffusion_steps,
            embedding_scale=embedding_scale,
        )
        trimmed_audio = trim_audio(audio)
        audios.append(trimmed_audio)
    output_audio = np.concatenate(audios)
    if output_wav_file:
        scipy.io.wavfile.write(output_wav_file, rate=output_sample_rate, data=output_audio)
    return output_sample_rate, output_audio

nvadigauvce · 2024-07-25T08:46:38Z

@traderpedroso Thanks for detailed answer and code, this is very helpful.

nvadigauvce · 2024-07-25T08:53:54Z

@tanishbajaj101 I have trained Hindi StyleTTS2 model, with existing English BERT model and it seems to be working fine without any issue. So I have not yet explored Hindi PL-BERT model.

traderpedroso · 2024-07-30T02:13:05Z

@traderpedroso Thanks for detailed answer and code, this is very helpful.

I'm building a dataset creator for WebUI using models to recognize speakers, segment audio, and detect silence for cuts and padding. Using Whisper alone for cutting audio isn't ideal, it's hard to get good quality cuts. Doing it manually is a lot of work! I found some models on Hugging Face that might help, so I'm hoping to develop something that makes fine-tuning easier for everyone. If I get something working well, I'll share it here. Thanks, and see you later!

xujzouyyz · 2024-08-17T18:35:34Z

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.
How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

If I want to train a new language model. There are a few steps I need to follow:

Train a ASR model for the new language and use it in TTS training;
Train a PL-Bert model for the new language and use it in TTS training;
prepare audio-text-phoneme data;
Train TTS model.
Could you please tell me whether there are any other steps I need to do?

martinambrus · 2024-08-20T13:26:46Z

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).

If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.

I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.

EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.

Thanks again!

traderpedroso · 2024-08-22T00:51:09Z

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).

If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.

I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.

EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.

Thanks again!

Yes, I trained with Brazilian Portuguese and Italian, and I had a lot of success in the results using PL-BERT. You don't need to change anything in the code, just replace everything in the /Utils/PLBERT/ folder with the multilingual version.

Another thing I noticed is that it's better to have a few audios with perfect cuts than hundreds of audios with cuts that generate noise. So, padding at the beginning and end of the audio is extremely necessary. I was a bit short on time these days, but next week I'll make a Gradio available to apply the cuts and make it easier. As I mentioned, training each time with audios of the same size generates better results. The English model in fine-tuning with only 5 epochs with 15 minutes of audio already starting SLM active had undeniable results, always using the rule of audios of the same size, the minimum of 4 seconds.

mantrakp04 · 2024-08-23T18:11:05Z

does anyone have an easy to follow jupyter notebook or webui?

martinambrus · 2024-08-23T18:16:38Z

does anyone have an easy to follow jupyter notebook or webui?

Not for multilingual but for single language I have created 2 notebooks here: #144

The training notebook is easily adaptable to multilingual by simply exchanging the PL-BERT subfolder in the Utils folder by the multilingual one, or one that you trained yourself. For example, I used https://huggingface.co/gerulata/slovakbert for Slovak language.

mantrakp04 · 2024-08-24T09:34:41Z

For a single-language, multi-speaker dataset of approximately 1k hours, primarily consisting of Hindi audiobook recordings, would you recommend training a model from scratch or fine-tuning?

martinambrus · 2024-08-26T08:49:18Z

@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :)

traderpedroso · 2024-08-28T07:17:51Z

@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :)

https://github.com/yl4579/AuxiliaryASR

I suggest you use the official one, I made some workarounds to make it work with phonemizer in Brazilian Portuguese, you can usually create your train list and validation list already converted to phonemes as I was using for testing custom phonemes, I ended up modifying a lot of mine and I believe it won't be useful in your case, just to be clear, AuxiliaryASR training will improve pronunciation in the language you train it on, and it's not necessary for English.

However, if you want to use mine, simply modify meldataset.py where you have global_phonemizer = phonemizer.backend.EspeakBackend(language='pt-br', preserve_punctuation=True, with_stress=True). to global_phonemizer = phonemizer.backend.EspeakBackend(language='your language iso', preserve_punctuation=True, with_stress=True), remembering that your dataset cannot contain phonemes, but rather in this format LJSpeech-1.1/wavs/LJ048-0203.wav|The three officers confirm that their primary concern was crowd and traffic control,|0.

5Hyeons · 2024-09-06T00:49:00Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

Thanks for your detailed explanation! It's helpful for me.
But, I have a simple question: when you trained the model with audio from 2 ~ 4 seconds, did you preprocess anything from original audio? I mean, did cut or modify the original data to make clips?
and In fine-tuning stage, should I make the data "same length"?

martinambrus · 2024-09-18T09:53:10Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

Hi :)

I'm trying your approach and have a question about the 2-4 seconds training. I'm doing 2nd stage and have config set like this:

diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)

What I found is that for about first 25 epochs, the quality of the model was actually really good. But after that, especially when the joint epoch started to kick in, I see a large degradation in the output quality (some letters not pronounced at all etc.).

So, my question is - what settings did you use for the 2-4s training, both 1st and 2nd stage? If you remember / can disclose this.

EDIT: I understand that the 2-4s training is not really about quality but I'm asking since I'm getting these inconsistent results

Thanks!

martinambrus · 2024-09-21T19:25:40Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

OK, so I tried this approach with a well-prepared custom English dataset but the results were not very good. While the humanization worked reasonably well, the quality wasn't too good. There were quite a few artifacts when I finetuned the 2-4s model.

I then tried to continue training the 2-4s model but with 7-9s data. The quality improved a lot (much more than when I tried to finetune). However, I was still hearing artifacts in the audio (but I've only trained with 7-9s wavs for like 17 epochs, so maybe it would go away eventually).

traderpedroso · 2024-10-12T02:04:17Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

OK, so I tried this approach with a well-prepared custom English dataset but the results were not very good. While the humanization worked reasonably well, the quality wasn't too good. There were quite a few artifacts when I finetuned the 2-4s model.

I then tried to continue training the 2-4s model but with 7-9s data. The quality improved a lot (much more than when I tried to finetune). However, I was still hearing artifacts in the audio (but I've only trained with 7-9s wavs for like 17 epochs, so maybe it would go away eventually).

Sorry for the late response. I used 2 to 4 seconds only to speed up the training, as I couldn't train for a long time without wasting a lot of resources. Therefore, when you train with 2 to 4 seconds for 50 epochs, after that you will use the model as a base and retrain it for 10 epochs. The noise, undoubtedly, was caused by the sudden cuts in some of your audio. It's better to have a smaller amount of quality data than a large amount of data with sudden cuts. You need to ensure that you have silence after the pronunciation of the last word.

I'll give you an example: it's not just about adding silence after the cut, but ensuring that you have at least 100ms of silence in your audio after the pronunciation of the last word. Then you add padding, as I previously mentioned. I used 500ms. Then, during inference, you need to make a post-cut in the audio to remove the amount of silence. In my training, my output became perfect, just like the original.

SandyPanda-MLDL · 2024-10-12T04:55:53Z

200 hrs of audio data

…

On Sat, 12 Oct 2024, 07:34 Emerson Pedroso, ***@***.***> wrote: @traderpedroso <https://github.com/traderpedroso> How many hours of audio data did you use for training? 6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config? Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800. OK, so I tried this approach with a well-prepared custom English dataset but the results were not very good. While the humanization worked reasonably well, the quality wasn't too good. There were quite a few artifacts when I finetuned the 2-4s model. I then tried to continue training the 2-4s model but with 7-9s data. The quality improved a lot (much more than when I tried to finetune). However, I was still hearing artifacts in the audio (but I've only trained with 7-9s wavs for like 17 epochs, so maybe it would go away eventually). Sorry for the late response. I used 2 to 4 seconds only to speed up the training, as I couldn't train for a long time without wasting a lot of resources. Therefore, when you train with 2 to 4 seconds for 50 epochs, after that you will use the model as a base and retrain it for 10 epochs. The noise, undoubtedly, was caused by the sudden cuts in some of your audio. It's better to have a smaller amount of quality data than a large amount of data with sudden cuts. You need to ensure that you have silence after the pronunciation of the last word. I'll give you an example: it's not just about adding silence after the cut, but ensuring that you have at least 100ms of silence in your audio after the pronunciation of the last word. Then you add padding, as I previously mentioned. I used 500ms. Then, during inference, you need to make a post-cut in the audio to remove the amount of silence. In my training, my output became perfect, just like the original. — Reply to this email directly, view it on GitHub <#257 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOSXU5CELRSFPHAI73KAP23Z3B7TPAVCNFSM6AAAAABJYRBF3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBYGMYDGNZRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

martinambrus · 2024-10-15T10:52:46Z

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

OK, so I tried this approach with a well-prepared custom English dataset but the results were not very good. While the humanization worked reasonably well, the quality wasn't too good. There were quite a few artifacts when I finetuned the 2-4s model.
I then tried to continue training the 2-4s model but with 7-9s data. The quality improved a lot (much more than when I tried to finetune). However, I was still hearing artifacts in the audio (but I've only trained with 7-9s wavs for like 17 epochs, so maybe it would go away eventually).

Sorry for the late response. I used 2 to 4 seconds only to speed up the training, as I couldn't train for a long time without wasting a lot of resources. Therefore, when you train with 2 to 4 seconds for 50 epochs, after that you will use the model as a base and retrain it for 10 epochs. The noise, undoubtedly, was caused by the sudden cuts in some of your audio. It's better to have a smaller amount of quality data than a large amount of data with sudden cuts. You need to ensure that you have silence after the pronunciation of the last word.

I'll give you an example: it's not just about adding silence after the cut, but ensuring that you have at least 100ms of silence in your audio after the pronunciation of the last word. Then you add padding, as I previously mentioned. I used 500ms. Then, during inference, you need to make a post-cut in the audio to remove the amount of silence. In my training, my output became perfect, just like the original.

Oh, I didn't mean that I got artifacts towards the end of audio, as people who did finetune StyleTTS2 without the 100ms silence at the end experienced. I mean when I finetuned, there were garbled letters and some wild variations in the output thorough the full inference. Only when I continued the training from 100th Epoch of the 2-4s model and used 7-9s wavs for that, these started to fade away. But thanks for the insight :)

This comment was marked as resolved.

Sign in to view

martinambrus mentioned this issue Sep 1, 2024

weird pulse at the end of the model #216

Open

martinambrus mentioned this issue Sep 21, 2024

(Q) Multi/Single Speaker different language finetune #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-lingual training #257

Multi-lingual training #257

nvadigauvce commented Jun 23, 2024

SandyPanda-MLDL commented Jun 23, 2024 •

edited

Loading

nvadigauvce commented Jun 24, 2024

traderpedroso commented Jun 25, 2024

nvadigauvce commented Jun 25, 2024

traderpedroso commented Jun 26, 2024 •

edited

Loading

nvadigauvce commented Jun 26, 2024

traderpedroso commented Jun 27, 2024

nvadigauvce commented Jun 28, 2024

mc-marcocheng commented Jun 30, 2024

traderpedroso commented Jul 2, 2024 •

edited

Loading

mc-marcocheng commented Jul 3, 2024

traderpedroso commented Jul 8, 2024 •

edited

Loading

nvadigauvce commented Jul 8, 2024

tanishbajaj101 commented Jul 19, 2024

traderpedroso commented Jul 23, 2024

nvadigauvce commented Jul 25, 2024

nvadigauvce commented Jul 25, 2024

traderpedroso commented Jul 30, 2024

xujzouyyz commented Aug 17, 2024

martinambrus commented Aug 20, 2024 •

edited

Loading

traderpedroso commented Aug 22, 2024

This comment was marked as resolved.

mantrakp04 commented Aug 23, 2024 •

edited

Loading

martinambrus commented Aug 23, 2024

mantrakp04 commented Aug 24, 2024 •

edited

Loading

martinambrus commented Aug 26, 2024

traderpedroso commented Aug 28, 2024 •

edited

Loading

5Hyeons commented Sep 6, 2024 •

edited

Loading

martinambrus commented Sep 18, 2024 •

edited

Loading

martinambrus commented Sep 21, 2024

traderpedroso commented Oct 12, 2024

SandyPanda-MLDL commented Oct 12, 2024 via email

martinambrus commented Oct 15, 2024

Multi-lingual training #257

Multi-lingual training #257

Comments

nvadigauvce commented Jun 23, 2024

SandyPanda-MLDL commented Jun 23, 2024 • edited Loading

nvadigauvce commented Jun 24, 2024

traderpedroso commented Jun 25, 2024

nvadigauvce commented Jun 25, 2024

traderpedroso commented Jun 26, 2024 • edited Loading

nvadigauvce commented Jun 26, 2024

traderpedroso commented Jun 27, 2024

nvadigauvce commented Jun 28, 2024

mc-marcocheng commented Jun 30, 2024

traderpedroso commented Jul 2, 2024 • edited Loading

mc-marcocheng commented Jul 3, 2024

traderpedroso commented Jul 8, 2024 • edited Loading

nvadigauvce commented Jul 8, 2024

tanishbajaj101 commented Jul 19, 2024

traderpedroso commented Jul 23, 2024

nvadigauvce commented Jul 25, 2024

nvadigauvce commented Jul 25, 2024

traderpedroso commented Jul 30, 2024

xujzouyyz commented Aug 17, 2024

martinambrus commented Aug 20, 2024 • edited Loading

traderpedroso commented Aug 22, 2024

This comment was marked as resolved.

mantrakp04 commented Aug 23, 2024 • edited Loading

martinambrus commented Aug 23, 2024

mantrakp04 commented Aug 24, 2024 • edited Loading

martinambrus commented Aug 26, 2024

traderpedroso commented Aug 28, 2024 • edited Loading

5Hyeons commented Sep 6, 2024 • edited Loading

martinambrus commented Sep 18, 2024 • edited Loading

martinambrus commented Sep 21, 2024

traderpedroso commented Oct 12, 2024

SandyPanda-MLDL commented Oct 12, 2024 via email

martinambrus commented Oct 15, 2024

SandyPanda-MLDL commented Jun 23, 2024 •

edited

Loading

traderpedroso commented Jun 26, 2024 •

edited

Loading

traderpedroso commented Jul 2, 2024 •

edited

Loading

traderpedroso commented Jul 8, 2024 •

edited

Loading

martinambrus commented Aug 20, 2024 •

edited

Loading

mantrakp04 commented Aug 23, 2024 •

edited

Loading

mantrakp04 commented Aug 24, 2024 •

edited

Loading

traderpedroso commented Aug 28, 2024 •

edited

Loading

5Hyeons commented Sep 6, 2024 •

edited

Loading

martinambrus commented Sep 18, 2024 •

edited

Loading