Google Research Scientist, Thibault Sellam, presents the introduction of SQuId (Speech Quality Identification), a regression model that evaluates the naturalness of speech in many languages. The goal is to provide a low-cost alternative to human evaluation for TTS (text-to-speech) models.
SQuId is a 600M parameter model that takes an utterance as input and returns a score between 1 and 5, indicating the naturalness of the waveform. It consists of three components: an encoder, a pooling/regression layer, and a fully connected layer. The encoder embeds a spectrogram into a smaller 2D matrix, which is then aggregated and fed into the fully connected layer for scoring. The model is trained on the SQuId corpus, a collection of 1.9 million rated utterances across 66 languages.
The SQuId corpus covers various TTS systems and use cases, exposing the model to a wide range of errors, such as acoustic artifacts, prosody issues, text normalization errors, and pronunciation mistakes. Despite variations in training data availability for different languages, the decision was made to train one model for all languages, leveraging cross-locale transfer for improved accuracy.
Experimental results show that SQuId outperforms state-of-the-art baselines, achieving up to 50.0% higher accuracy. Cross-locale transfer is particularly effective for most languages, except for English, which is already well-represented in the dataset. SQuId can even handle languages it has never seen before, demonstrating the effectiveness of cross-locale transfer.
Future work includes improving the accuracy of the model, expanding language coverage, and addressing new error types. SQuId is a significant step toward evaluating speech quality in multiple languages and complementing human evaluation in the field of TTS research.