copyright

lastupdated

subcollection

years
2015, 2020

2020-07-29

speech-to-text

The science behind the service

{: #science}

As Pioneering Speech Recognition{: external} describes, {{site.data.keyword.IBM}} has been at the forefront of speech recognition research since the early 1960s. For example, Bahl, Jelinek, and Mercer (1983) describes the basic mathematical approach to speech recognition that is employed in essentially all modern speech recognition systems. And Jelinek (1985) describes the creation of the first real-time large vocabulary speech recognition system for dictation. This paper also describes problems that are still unsolved research topics today. {: shortdesc}

{{site.data.keyword.IBM_notm}} continues this rich tradition of research and development with the {{site.data.keyword.speechtotextfull}} service. {{site.data.keyword.IBM_notm}} has demonstrated industry-record speech recognition accuracy on the public benchmark data sets for Conversational Telephone Speech (CTS) (Saon and others, 2017) and Broadcast News (BN) transcription (Thomas and others, 2019). {{site.data.keyword.IBM_notm}} leveraged neural networks for language modeling (Kurata and others, 2017a, and Kurata and others, 2017b), in addition to demonstrating the effectiveness of acoustic modeling.

The following announcements summarize {{site.data.keyword.IBM_notm}}'s recent speech recognition accomplishments:

Reaching new records in speech recognition{: external}
{{site.data.keyword.IBM_notm}} Breaks Industry Record for Conversational Speech Recognition by Extending Deep Learning Technologies{: external}
{{site.data.keyword.IBM_notm}} Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning{: external}
IBM Research AI Advances Speaker Diarization in Real Use Cases{: external}
Advancing RNN Transducer Technology for Speech Recognition{: external}

These accomplishments contribute to further advance {{site.data.keyword.IBM_notm}}'s speech services. Recent ideas that best fit the cloud-based {{site.data.keyword.speechtotextshort}} service include

For language modeling, {{site.data.keyword.IBM_notm}} leverages a neural network-based language model to generate training text (Suzuki and others, 2019).
For acoustic modeling, {{site.data.keyword.IBM_notm}} uses a fairly compact model to accommodate the resource limitations of the cloud. To train this compact model, {{site.data.keyword.IBM_notm}} uses "teacher-student training / knowledge distillation." Large and strong neural networks such as Long Short-Term Memory (LSTM), VGG, and the Residual Network (ResNet) are first trained. The output of these networks is then used as teacher signals to train a compact model for actual deployment (Fukuda and others, 2017).

To further push the envelope, {{site.data.keyword.IBM_notm}} also focuses on end-to-end modeling. For example, it has established a strong modeling pipeline for direct acoustic-to-word models (Audhkhasi and others, 2017, and Audhkhasi and others, 2018) that it is now further improving (Saon and others, 2019). It is also making efforts to create compact end-to-end models for future deployment on the cloud (Kurata and Audhkhasi, 2019).

For more information about the scientific research behind the service, see the documents that are listed in Research references.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

science.md

science.md

The science behind the service

Files

science.md

Latest commit

History

science.md

File metadata and controls

The science behind the service