-
Notifications
You must be signed in to change notification settings - Fork 0
Home
In the field of broadcasting, the quality of audio files is very important. In some situations, the content or quality of the recordings is affected to the point of being unusable. Since the speaker cannot always re-record the program, it is canceled. A speech synthesis system with speech conditioning could overcome this problem by recreating the voice of the host from simple text transcriptions. This study focuses on the development of a voice cloning system from a very small but complete dataset, covering all the phonemes of the target language. The process includes data preparation, data processing, model training and evaluation. This model has demonstrated efficiency, achieving an average loss of 1.02 in a very short time. This performance is promising and suggests the possibility of future improvements in the field of speech synthesis on small datasets.
Speech synthesis has evolved a lot in recent times with the great advances in the field of artificial intelligence and natural language processing. It is found in several applications ranging from voice assistance, audio books or even real-time translation out loud.
Voice conditioning, which allows to imitate specific voices, increases the possibilities in the communication and entertainment sectors. This technology offers practical solutions in situations where voice recordings are of very poor quality or even unavailable.
The major challenge with speech synthesis is the training of high-quality models on limited datasets. These systems need large amounts of data in order to produce natural results. The ability to effectively condition the target voice remains a complex task.
In this context, the objective of my study is to develop a conditioned speech synthesis system that can be effectively trained on a small data set. Using optimization and language processing techniques, I aim to create a model that can faithfully reproduce the voice of an interlocutor.
This system could be useful in the broadcasting sector for example, where faithful reproduction of the voice is essential.
Traditional speech synthesis models require large datasets for training their models. In my case, my system is distinguished by the fact that it uses small datasets.
This synthesis process is done in two main steps: the prediction of spectrograms from texts, followed by the generation of audio from these spectrograms. In Figure 1, we can observe the architecture of such a system.
Among the state-of-the-art models, Tacotron 2 \cite{TTS}, stands out by synthesizing natural-sounding speech from textual transcriptions only without any additional prosodic information. From the input text, it produces Mel spectrograms using an encoder-decoder architecture. As can be seen in Figure 1, a vocoder like WaveNet, HIFI-GAN \cite{hifigan} or WaveGlow \cite{WaveGlow} is then applied to generate speech from the prediction of these spectrograms.
To condition a person's voice, my system relies on Nvidia's models, Tacotron2 and WaveGlow \cite{NvidiaWaveGlow}, both trained on JLSpeech, a large single-speaker dataset. Unlike the original model, the Tacotron 2 implementation with Pytorch uses Dropout instead of zone to regularize the LSTM layers \cite{TTSpytorch}.
Besides the choice of the pre-trained model, data collection is crucial in order to optimize the training of my model on a small dataset unlike traditional systems.
The speech synthesis process was carried out in several steps in order to produce an efficient speech synthesis system.
\textbf{First step: Data preparation and processing}. To simplify the task, the English language was chosen. An initial selection of sentences, including phonetic pangrams, was made to cover all the phonemes of English, essential for the training of the model. The data collection allowed to gather the 59 voice recordings and their corresponding text transcriptions for a total duration of only 4 minutes and 6 seconds. A cleaning phase was applied to the recordings to remove background noise, silences and normalize the audio volume. The texts were converted into phonemes using the CMU pronunciation dictionary and the ARPAbet method \cite{CMUarpabet}, followed by the generation of the corresponding spectrograms. For an audio file we therefore have its text transcription and its corresponding spectrogram.
\textbf{Second step: Modeling with Tacotron2}. Tacotron2, an advanced speech synthesis model based on an attention mechanism, was used. The training of the model focused on the conversion of text into corresponding spectrograms. Periodic validations were performed in order to monitor its learning and adjust the hyperparameters. The calculation of the average loss and the conversion of the generated spectrogram into an audio signal with the vocoder were crucial in this task.
\textbf{Third step: Evaluation and optimization}. This last step involved an analysis of the quality of the synthesized voice based on the input text and its similarity with the target voice. A final test was performed to confirm the achievement of the project objectives. This process resulted in a high-quality synthetic voice with the possibility of improving the naturalness of the generated voice afterwards.
The model results demonstrated the effectiveness of its learning. As can be seen in Figure 2, the model demonstrated a high capacity to retain essential information in only 20 epochs. It managed to faithfully synthesize the target voice from a restricted dataset of only 4 minutes of audio recordings. In addition, a very good alignment between the spectrograms and the corresponding texts can be observed, which indicates a good learning of the model.
The strategic use of phonetic pangrams as well as the selection of models pre-trained on a specific dataset with a single interlocutor played a crucial role in building a robust system. This approach allowed to give the speech synthesis system the ability to reproduce a natural and accurate voice in a very short time.
The use of pangrams allowed to cover all the existing phonemes of the language and thus allow a complete learning of the model. While the exploitation of the pre-trained models on eyeballed data allowed to optimize the training process of the model and thus to build a particularly efficient speech synthesis system.
These results show the possibility of future improvements in the field of speech synthesis. This project contributes to the field by exploring new approaches to improve the fidelity of the synthesized voice compared to the original voice on restricted data.
The results obtained for this project show the viability of the system as well as the potential practical applications, particularly in the broadcasting sector. The particularity of the system to produce a quality synthesized voice from a very small data set in a very short time is very promising.
However, there are still some challenges to be overcome such as improving the naturalness of the synthesized voice as well as extending this technique with other less common languages such as French or integrating this system into real-time applications.
In conclusion, this project has laid the foundations for future research in the field of speech synthesis with conditioning and on limited data sets.