Turkish Automated Speech Recognition (ASR) using Facebook's Wav2vec 2.0 models
The following Wav2vec 2.0 models were finetuned during Huggingface's Robust Speech Challenge event:
- mpoyraz/wav2vec2-xls-r-300m-cv6-turkish achives 8.83 % WER on Common Voice 6.1 TR test split
- mpoyraz/wav2vec2-xls-r-300m-cv7-turkish achives 8.62 % WER on Common Voice 7 TR test split
- mpoyraz/wav2vec2-xls-r-300m-cv8-turkish achives 10.61 % WER on Common Voice 8 TR test split
The following open source speech corpora is available for Turkish:
This repo contains pre-processing and training scripts for these corpora.
After downloading Turkish speech corpora above, preprocess.py
can be used to create datasets files for training.
- The script handles the text normalization required for proper training.
- Common Voice TR corpus is handled as follows:
- Train split: all samples in
validated
split exceptdev
andtest
samples is reserved to training. - Validation split: same as
dev
split. - Test split: same as
test
split.
- Train split: all samples in
- Media Speech corpus is fully included in the final train split if provided.
- Final datasets CSV files with 'path' & 'sentence' columns are saved to the output directory:
train.csv
,validation.csv
andtest.csv
python preprocess.py \
--vocab vocab.json \
--cv_path data/cv-corpus-<version>-<date>/tr \
--media_speech_path data/TR \
--output data \
facebook/wav2vec2-xls-r-300m is large-scale multilingual pretrained model for speech and used for fine-tuning on Turkish speech corpora. The exact hyperparameters used are available at model card on each finetuned model on Huggingface model hub.
An example training command:
python train_asr.py \
--model_name_or_path facebook/wav2vec2-xls-r-300m \
--vocab_path vocab.json \
--train_file train_validation.csv \
--validation_file test.csv \
--output_dir exp \
--audio_path_column_name path \
--text_column_name sentence \
--preprocessing_num_workers 4 \
--dataloader_num_workers 4 \
--eval_metrics wer cer \
--freeze_feature_extractor \
--mask_time_prob 0.1 \
--mask_feature_prob 0.1 \
--attention_dropout 0.05 \
--activation_dropout 0.05 \
--feat_proj_dropout 0.05 \
--final_dropout 0.1 \
--learning_rate 2.5e-4 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 8 \
--num_train_epochs 20 \
--warmup_steps 500 \
--eval_steps 500 \
--save_steps 500 \
--evaluation_strategy steps \
--save_total_limit 2 \
--gradient_checkpointing \
--fp16 \
--group_by_length \
--do_train \
--do_eval \
The following finetuned models are available on Huggingface model hub and has an evaluation script eval.py
with appropiate text normalization. The commands for running evaluations are also available on the model cards.
- mpoyraz/wav2vec2-xls-r-300m-cv6-turkish achives 8.83 % WER on Common Voice 6.1 TR test split
- mpoyraz/wav2vec2-xls-r-300m-cv7-turkish achives 8.62 % WER on Common Voice 7 TR test split
- mpoyraz/wav2vec2-xls-r-300m-cv8-turkish achives 10.61 % WER on Common Voice 8 TR test split
For CTC beam search decoding with shallow LM fusion, n-gram language model is trained on a Turkish Wikipedia articles using KenLM and ngram-lm-wiki repo was used to generate arpa LM and convert it into binary format.