Skip to content

Latest commit

 

History

History
119 lines (75 loc) · 7.24 KB

README.md

File metadata and controls

119 lines (75 loc) · 7.24 KB

DeepForcedAligner for Text-to-Speech

license standard-readme compliant

🚧 Under Construction! This repo is not expected to work fully. Please check back later for a stable release. 🚧

A fork of the DeepForcedAligner project implemented in PyTorch Lightning

From the original authors:

With this tool you can create accurate text-audio alignments given a bunch of audio files and their transcription. The alignments can for example be used to train text-to-speech models such as FastSpeech. In comparison to other forced alignment tools this repo has following advantages:

  • Multilingual: By design, the DFA is language-agnostic and can align both characters or phonemes.
  • Robustness: The alignment extraction is highly tolerant against text errors and silent characters.
  • Convenience: Easy installation with no extra dependencies. You can provide your own data in the standard LJSpeech format without special preprocessing (such as applying phonetic dictionaries, non-speech annotations etc.).

The approach is based on training a simple speech recognition model with CTC loss on mel spectrograms extracted from the wav files.

This repo has been separated in case you would like to use it separately from the broader SGILE system, but if you are looking to build speech synthesis systems from scratch, please visit the main repository

Table of Contents

See also:

Background

There are approximately 70 Indigenous languages spoken in Canada from 10 distinct language families. As a consequence of the residential school system and other policies of cultural suppression, the majority of these languages now have fewer than 500 fluent speakers remaining, most of them elderly.

Despite this, Indigenous people have resisted colonial policies and continued speaking their languages, with interest by students and parents in Indigenous language education continuing to grow. Teachers are often overwhelmed by the number of students, and the trend towards online education means many students who have not previously had access to language classes now do. Supporting these growing cohorts of students comes with unique challenges in languages with few fluent first-language speakers. Teachers are particularly concerned with providing their students with opportunities to hear the language outside of class.

While there is no replacement for a speaker of an Indigenous language, there are possible applications for speech synthesis (text-to-speech) to supplement existing text-based tools like verb conjugators, dictionaries and phrasebooks.

The National Research Council has partnered with the Onkwawenna Kentyohkwa Kanyen’kéha immersion school, W̱SÁNEĆ School Board, University nuhelot’įne thaiyots’į nistameyimâkanak Blue Quills, and the University of Edinburgh to research and develop state-of-the-art speech synthesis (text-to-speech) systems and techniques for Indigenous languages in Canada, with a focus on how to integrate text-to-speech technology into the classroom.

Installation

Clone clone the repo and pip install it locally:

$ git clone https://github.com/EveryVoiceTTS/DeepForcedAligner.git
$ cd DeepForcedAligner
$ pip install -e .

Usage

Configuration

It is recommended to install everyvoice and run everyvoice new-project to create your configuration. After running the command, a config folder will be created with an everyvoice-aligner.yaml configuration file.

Preprocessing

Preprocess by running: dfaligner preprocess config/everyvoice-aligner.yaml to generate the Mel spectrograms and audio and text representations required for the model using the base configuration.

To run only a subset of the steps, use dfaligner preprocess config/everyvoice-aligner.yaml -s text to only run text preprocessing for example

Training

Train by running dfaligner train config/everyvoice-aligner.yaml to use the base configuration.

You can pass updates to the configuration through the command line like so:

dfaligner train config/everyvoice-aligner.yaml -c preprocessing.save_dir=/my/new/path -c training.batch_size=16

Alignment Extraction

To extract alignments from the model run dfaligner extract-alignments config/everyvoice-aligner.yaml --model path/to/model.ckpt.

To visualize some of your alignments and inspect them with Praat, run dfaligner extract-alignments config/everyvoice-aligner.yaml --model path/to/model.ckpt --tg utt_count, where utt_count is the number of sample alignments you want to generate TextGrids for. To output alignments in text_grid format for your entire dataset, run dfaligner extract-alignments config/everyvoice-aligner.yaml --model path/to/model.ckpt --tg utt_count, where utt_count is the number of utterances in your dataset.

The extracted alignment can be found in save_dir specified in config/everyvoice-shared-data.yaml

Contributing

Feel free to dive in!

  • Open an issue in the main EveryVoice repo with the tag [DeepForceAligner],
  • submit PRs to this repo with a corresponding submodule update PR to EveryVoice.

This repo follows the Contributor Covenant Code of Conduct.

You can install our standard Git hooks by running these commands in your sandbox:

pip install -r requirements.dev.txt
pre-commit install
gitlint install-hook

Have a look at Contributing.md for the full details on the Conventional Commit messages we prefer, our code formatting conventions, and our Git hooks.

You can then interactively install the package by running the following command from the project root:

pip install -e .

Acknowledgements

This project is only possible because of the work of the authors of DeepForcedAligner (Christian Schäfer and Francisco Cardinale). Please cite and start their work.