Skip to content

Final project titled "Fine-tuning Automatic Speech Recognition Models for Accented English" as a program requirement for completion at the Apple-NACME Artificial Intelligence-Machine Learning Intensive Bootcamp 2024 hosted at the University of Southern California.

License

Notifications You must be signed in to change notification settings

romerocruzsa/finetuning-asr

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Review Assignment Due Date Open in Codespaces

Fine-tuning Automatic Speech Recognition (ASR) Models for Accented Speech

NACME Logo Apple Logo USC Logo

National Action Council for Minorities in Engineering(NACME) Artificial Intelligence - Machine Learning (AIML) Intensive Summer Bootcamp at the University of Southern California

Developed by: Dialect Dynamics

[Name] - Major - Undergraduate Institution - Role

  • Sebastián A. Cruz Romero - Computer Science and Engineering - University of Puerto Rico at Mayagüez - Lead
  • Aliya Daire - Computer Science and Business Administration - University of Southern California - Notes
  • Brandon Medina - Electrical Engineering - Texas A&M - Kingsville
  • Samuel Ovalle - Mechanical Engineering - Florida International University - Notes

Description

Project Summary (PDF)

project-desc-p1 project-desc-p2

What is Automatic Speech Recognition?

Automatic Speech Recognition (ASR) converts spoken language into text, widely utilized in applications like virtual assistants and transcription services. Traditional ASR relied on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs), but the advent of deep learning revolutionized the field. Neural networks, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), improved the recognition of complex speech patterns. Recent advancements include Transformer models, which leverage self-attention mechanisms for handling large contexts and dependencies, exemplified by state-of-the-art systems like OpenAI's Whisper and Facebook AI's Wav2Vec 2.0.

Example ASR Model Wav2Letter
Despite these advancements, ASR models often perform poorly on accented speech due to biases in training data, which predominantly features standard accents. Accented speech introduces acoustic variability and phonetic differences that are not well-represented in the models' training sets. These models also face limitations in language processing for accented variations, leading to inaccuracies. Additionally, many ASR systems lack sufficient fine-tuning on accented datasets, further exacerbating performance issues.

Addressing these challenges requires augmenting training datasets with diverse accented speech, employing transfer learning to fine-tune models on such data, and using domain adaptation techniques to adjust models for specific accents. Phonetic adaptation and user-specific learning can also enhance performance by accounting for individual speech patterns. By implementing these strategies, ASR technology can become more inclusive and effective across different linguistic backgrounds, improving its applicability in real-world scenarios.

Accented English Datasets

Accented English refers to variations in pronunciation, intonation, and stress patterns influenced by a speaker's native language or regional background. These accents can significantly differ from standard or neutral English, posing challenges for ASR systems. African-American Vernacular English (AAVE) is a prominent example, characterized by unique grammatical, phonological, and syntactical features. AAVE's distinctiveness stems from historical and cultural influences, making it a vital dialect for linguistic study and representation in ASR systems. Similarly, Indic Accented English, spoken by individuals from the Indian subcontinent, incorporates phonetic and intonational patterns influenced by native Indic languages, leading to distinct variations in English pronunciation.

The Corpus of Regional African American Language (CORAAL) is a comprehensive dataset capturing the linguistic features of AAVE across different regions and contexts. It provides valuable data for studying the intricacies of AAVE and improving ASR systems' ability to recognize and process this dialect accurately.On the other hand, the Svarah dataset focuses on Indic Accented English, offering a rich collection of speech data from Indian English speakers. Developed by AI4Bharat, Svarah aims to enhance ASR models' performance on Indic English, addressing the specific challenges posed by this accent. Both datasets are instrumental in advancing the inclusivity and accuracy of ASR technology across diverse English accents.

Our goal is to adapt ASR models to capture AAVE and/or IAAE speech and its syntactic features for accurate transcription tasks.

Screenshot 2024-08-01 at 9 29 27 AM Screenshot 2024-08-01 at 9 27 31 AM

Fine-tuning Workflow

Screenshot 2024-08-01 at 9 34 15 AM
To improve the transcription of accented speech, particularly African-American Vernacular English (AAVE) and Indic Accented English (IAE), we aim to fine-tune the Whisper model by adjusting key hyperparameters. This fine-tuning involves optimizing batch size, learning rate, scheduler, and the number of epochs to enhance the model's performance on accented speech data. By carefully modifying these parameters, we can better adapt the model to the phonetic and intonational characteristics of AAVE and IAE, thereby increasing its transcription accuracy.

For preprocessing, we undertake two critical steps. First, we segment the audio to fit the Whisper model's maximum input length of 30 seconds, utilizing the start and end times from our transcript data. This segmentation ensures that the model processes manageable chunks of audio, preventing overflows and maintaining context. Second, we apply zero-padding to both audio and text tensors, ensuring uniformity in tensor size across batches. This padding aligns the data for efficient processing, enabling the model to handle variable-length inputs consistently. These preprocessing steps, combined with fine-tuning the model's hyperparameters, aim to significantly enhance ASR performance for AAVE and IAE.

Screenshot 2024-08-01 at 9 40 06 AM Screenshot 2024-08-01 at 9 40 06 AM

Preliminary Results

Model Type Sample Quantity Loss Word Error Rate
Pre-trained IAE-tested 680,000 hrs 4.3300 0.2786
Pre-trained AAVE-tested 680,000 hrs *0.0192 *0.4234
IAE-trained 9.6 hrs 4.4142 0.2746
AAVE-trained 8.620 hrs *0.2746 *0.3504

*Inference ran on a small subset due to resource-limitations and timeline constraints.

The preliminary results of our study reveal promising trends in the fine-tuning process of the Whisper model for accented speech recognition. The Training Loss per Epoch graph demonstrates a steady decline in training loss as the number of epochs increases, indicating effective learning and improved performance on the training data. This trend is mirrored in the Training Word Error Rate (WER) per Epoch graph, which shows a consistent decrease in WER, reflecting enhanced accuracy in the model's predictions during training.
Screenshot 2024-08-01 at 9 34 15 AM

The validation metrics further support these positive outcomes. The Validation Loss per Epoch graph exhibits a general downward trend, although with some fluctuations, suggesting that the model is performing well on unseen validation data. These variations indicate that additional tuning may be required to stabilize the learning process. Similarly, the Validation Word Error Rate (WER) per Epoch graph shows a declining trend with some variability, highlighting areas where model adjustments or more data could improve consistency and overall performance.
Screenshot 2024-08-01 at 9 34 15 AM

Conclusion and Discussion

  1. The models demonstrate no significant performance or accuracy difference between pre-trained and IAAE-trained model. However, significantly different sizes of data were used in each model.
  2. Results suggest potential of diverse English accents in training datasets can enhance ASR technology, making it more inclusive and accurate.
  3. Use of IAAE dataset in our ASR model suggest potential for ASR systems to be tailored to better serve underrepresented speech patterns, thereby addressing equity issues in technology.
  4. Detailed error analysis revealed common transcription errors related to specific phonetic nuances of IAAE, guiding future model enhancements

Future Work

  1. Perform fine-tuning on AAVE Speech and compare performance against SOTA AAVE-trained models.
  2. Extend research on other accented English. (Latin American, European, etc.)
  3. Explore optimization techniques for model performance on resource-limited environments.
  4. Develop mechanisms to apply user-feedback for continuous fine-tuning of evolving speech patterns.

Usage Instructions

  1. Pending

Poster Presentation

Poster Presentation was showcased on August 2nd, 2024 at the University of Southern California Viterbi School of Engineering Research Symposium.

image

Oral Presentation

Oral Presentation was showcased on August 1st, 2024 at the NACME Apple Artificial Intelligence-Machine Learning Intensive Bootcamp 2024 Final Project Showcase.

Screenshot 2024-08-01 at 12 22 48 PM 1

References

  1. J. Baugh, Beyond Ebonics: Linguistic Pride and Racial Prejudice. Oxford University Press, 2000.
  2. W. Labov, Language in the Inner City: Studies in the Black English Vernacular. University of Pennsylvania Press, 1972.
  3. J. R. Rickford, African American Vernacular English: Features, Evolution, Educational Implications. Wiley-Blackwell, 1999.
  4. D. S. Ureña and K. Schindler, "CORAAL: The Corpus of Regional African American Language," University of Oregon, 2022. [Online]. Available: https://oraal.uoregon.edu/coraal. [Accessed: 01-Aug-2024].
  5. AI4Bharat, "Svarah: A Large Scale Indic Speech Recognition Dataset," 2022. [Online]. Available: https://github.com/AI4Bharat/Svarah. [Accessed: 01-Aug-2024].
  6. G. Hinton, O. Vinyals, and J. Dean, "Distilling the Knowledge in a Neural Network," arXiv preprint arXiv:1503.02531, 2015.
  7. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.
  8. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving Language Understanding by Generative Pre-Training," OpenAI, 2018.
  9. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," in Advances in Neural Information Processing Systems, 2020.
  10. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "RoBERTa: A Robustly Optimized BERT Pretraining Approach," arXiv preprint arXiv:1907.11692, 2019.
  11. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint arXiv:1810.04805, 2018.
  12. S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. S. Moritz, P. W. Chan, and S. Watanabe, "A Comparative Study on Transformer vs RNN in Speech Applications," arXiv preprint arXiv:1909.06317, 2019.
  13. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, "Conformer: Convolution-augmented Transformer for Speech Recognition," arXiv preprint arXiv:2005.08100, 2020.
  14. S. Wang, Y. Wu, X. Xu, Y. Xie, Y. Zhang, and H. Meng, "Fine-tuning Bidirectional Encoder Representations from Transformers (BERT) for Context-Dependent End-to-End ASR," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2938-2950, 2021.
  15. Kaldi ASR Toolkit, "Kaldi: A toolkit for speech recognition," 2011. [Online]. Available: http://kaldi-asr.org. [Accessed: 01-Aug-2024].
  16. A. Graves, A.-R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645-6649.
  17. A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Ng, "Deep Speech: Scaling up end-to-end speech recognition," arXiv preprint arXiv:1412.5567, 2014.
  18. W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960-4964.
  19. S. Niu, B. Tang, Z. Cheng, X. Chen, J. Zeng, Y. Huangfu, and H. Meng, "Improving Speech Recognition Systems for Accented English Speech with Multi-Accent Data Augmentation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1679-1692, 2020.
  20. R. Maas and D. Hovy, "Adapting Deep Learning Models for Accented Speech Recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 7969-7976, 2020.

About

Final project titled "Fine-tuning Automatic Speech Recognition Models for Accented English" as a program requirement for completion at the Apple-NACME Artificial Intelligence-Machine Learning Intensive Bootcamp 2024 hosted at the University of Southern California.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.2%
  • Python 0.8%