Improving Speech Recognition for People with Disordered Speech
The AI for Social Good team at Google is dedicated to creating positive social impact through the use of AI. They work on various projects related to public health, accessibility, crisis response, climate and energy, and nature and society. One of their teams, Project Euphonia, focuses on improving automatic speech recognition (ASR) for people with disordered speech.
Typically, ASR models have a word error rate (WER) of less than 10% for people with typical speech. However, for individuals with speech disorders such as stuttering, dysarthria, and apraxia, the WER can reach as high as 50% or even 90%. To address this issue, Project Euphonia collected over 1,000 hours of disordered speech samples from more than 1,000 participants. They discovered that personalization of ASR models can bridge the performance gap for users with disordered speech, with as little as 3-4 minutes of training speech.
This led to the development of Project Relate, which allows individuals with atypical speech to train their own personalized speech model. By using these personalized models, people can communicate more effectively and gain independence.
To make ASR more accessible, Project Euphonia fine-tuned Google’s Universal Speech Model (USM) to better understand disordered speech out of the box. This means that users can utilize digital assistant technologies, dictation apps, and have conversations without the need for additional personalization.
Despite the benefits of personalized models, collecting a large number of speech examples can be challenging for many users. Additionally, personalized models may not perform well in freeform conversation. To overcome these challenges, Project Euphonia focused on speaker-independent ASR (SI-ASR), which eliminates the need for additional training.
To build a robust SI-ASR model, Project Euphonia created the Prompted Speech dataset and the Real Conversation test set. The dataset includes over 950k speech utterances from over 1,000 speakers with disordered speech, while the test set contains over 1,500 utterances from 29 speakers recorded during conversations.
To adapt the USM to disordered speech, Project Euphonia used residual adapters, which are tunable bottleneck layers added as residuals between transformer layers. This approach proved successful in improving the performance of ASR models for disordered speech.
The adapted USM significantly outperformed previous models in terms of word error rates (WER). For the Real Conversation test set, the adapted USM performed 37% better than the pre-USM model, and for the Prompted Speech test set, it performed 53% better.
This improvement is evident when comparing transcripts of real conversation recordings. The adapted USM was better at recognizing disordered speech patterns and accurately transcribing important words that the baseline model missed.
In conclusion, Google’s Project Euphonia is making strides in improving speech recognition for people with disordered speech. They are committed to enhancing the accessibility and usability of ASR models through personalization and adaptation. With further advancements in ASR technology, they aim to ensure that individuals with disordered speech can benefit as well.
Key contributors to this project include Fadi Biadsy, Michael Brenner, Julie Cattiau, Richard Cave, Amy Chung-Yu Chou, Dotan Emanuel, Jordan Green, Rus Heywood, Pan-Pan Jiang, Anton Kast, Marilyn Ladewig, Bob MacDonald, Philip Nelson, Katie Seaver, Joel Shor, Jimmy Tobin, Katrin Tomanek, and Subhashini Venugopalan. The support of the USM research team and the participation of over 2,200 individuals in recording speech samples are also greatly appreciated.