Here are some toy examples you can finish in minutes. But before starting on larger datasets like Librispeech, it is recommended to read through the Training pipeline to understand what is done at each stage.
Prepare yesno
data:
- yesno data is only for debugging.
- it's hard to train a recognizer in such a small corpora based on CTC / RNN-T. Therefore, you will find higher error rates in results (compared to 0% in kaldi).
bash local/data.sh
-
Train Transducer (RNN-T)
python utils/pipeline/asr.py exp/asr-rnnt --ngpu 1
-
RNN-T + CUSIDE
Note that a very critical argument for CUSIDE setting is the
downsampling_ratio
in configuration. There's no such a autonomous way that the program could recognize the downsampling ratio, therefore you need to configure it according to your encoder setting.python utils/pipeline/asr.py exp/asr-rnnt-cuside --ngpu 1
-
Train CTC
python utils/pipeline/asr.py exp/asr-ctc --ngpu 1
-
Train CTC-CRF. CTC-CRF is somewhat more complex than vanilla CTC. Please refer to the
run.sh
for usage.bash exp/asr-ctc-crf/run.sh
-
For join-AP examples, please refer to this doc.
Prepare text data (a small subset from librispeech):
bash local/lm_data.sh
-
Train a Transformer LM
python utils/pipeline/lm.py exp/lm-nn --ngpu 1
-
Train a 3-gram word LM
bash utils/pipeline/ngram.sh exp/lm-ngram-word -o 3
Both ASR (RNN-T/CTC) and LM pipelines include 4 stages:
(data prepare) ->
tokenizer training -> data packing -> nn training -> inference
Where for ASR pipeline, the inference is the stage of decoding, and that for LM is the stage of calculating perplexity. Pipeline for n-gram training is kind of special, since there is no NN model and all operations are conduct on CPUs. Therefore, the "nn training" stage for n-gram pipeline is replaced by KenLM (the n-gram training tool) lmplz
trainining.
Note that data prepare is not included in standard pipelines. It's your duty to prepare the data in proper format in my design. I recommend to implement this part in the file local/data.sh
(in fact, I've already made that for several recipes), so others can easily reproduce your experiments. Here is an illustration of RNN-T pipeline:
What is done at each stage
-
Speech data preparation. As I mentioned, data preparation is your job. I do not take care of that. To begin with preparing a dataset yourself, please have a look at the yesno for a toy example prepare data with
torchaudio
.For a more meaningful example, please refer to aishell for preparing data with
torchaudio
orkaldi
(if you've kaldi installed). If you already have data in Kaldi standard format (.scp
for feats and a text transcription file.), you can add the path toegs/[task]/data/metainfo.json
as following:{ "libri-960": { "scp": "/path/to/libri-960.scp", "trans": "/path/to/libri-960.txt" }, "wsj": { "scp": "/path/to/wsj.scp", "trans": "/path/to/wsj.txt" } }
-
tokenizer training: we train the tokenizer with the training corpus (what you specified in
hyper['data']['train']
,hyper
is the content in hyper-parameter setting file, which is defaultlyhyper-p.json
) and the tokenizer setting (inhyper['tokenizer']
). The trained tokenizer would be stored as/path/to/tokenizer.tknz
if you don't specify the path, and can be loaded directly viaimport cat.shared.tokenizer as tknz tokenizer = tknz.load('path/to/tokenizer.tknz')
-
data packing: in short, at this stage, we pack the raw data (e.g. spectrum features or raw texts) into python readable objects (audio spectrum to numpy array, raw text to tokenized indices ...). It basically just transforms the data.
-
nn training: this is the most important (and most time-consuming, generally) stage. At this stage, the neural network would be fitting to the data, where the
hyper['data']['train']
is (are) used for training, thehyper['data']['dev']
is (are) used as held out set for evaluating the performance and control the early-stop (if requested). The neural network is configured by the fileexp/[my-exp]/config.json
. Apart from the model itself, training pipeline is configured by settings inhyper['train']
, where we specify properties like batch size, multi-node DDP training, automatic mixed precision training, etc. -
inference: at this stage, we evaluate the model by some specific means and metrics. It's highly recommended to do model averaging by setting
hyper['inference']['avgmodel']
, for which could bring significant improvement over singular models without introducing any extra computation overhead. The model will be evaluated based on your configuration at thehyper['data']['test']
data.4.1. For ASR (RNN-T/CTC) task, this is when decoding (tranform speech to text) happens. The decoding is configured by
hyper['inference']['infer']
(withhyper['inference']['infer']['bin']='cat.ctc.decode' or 'cat.rnnt.decode'
) . We usually use word-error rate (WER) or character-error rate (CER) to measure the performance of ASR model, so there is ahyper['inference']['er']
setting to configure how to compute the error rate.4.2 For LM task (whatever NNLM or n-gram LM), we use perplexity (PPL, or ppl) on the test sets to tell the performance of trained LMs. To make that, you should configure the ppl calculation setting in
hyper['inference']['infer']
(withhyper['inference']['infer']['bin']='cat.lm.ppl_compute'
). You may have seen one of the templates, lm-nn/hyper-p.json not including 'infer' setting. This is because the scriptpipeline/lm.py
could automatically configure it in some simple cases. If you have custom requirements, you still need to configure it yourself.
We're sorry that currently there is no such a detailed document that explains how to configure the settings. However, for most of the basic use cases, the template experiments can be taken as references, and the configuration guideline might help, more or less.
For more advanced settings and usages, you probably need to dive into the codes, but don't worry, most of the codes are well commented 🤓.