Google Research has introduced a new approach for spoken language modeling called Spectron. Spectron is the first spoken language model that directly processes spectrograms as input and output, rather than learning discrete speech representations. With only a pre-trained text language model, Spectron can be fine-tuned to generate high-quality spoken language. It also retains the knowledge of the original large language models (LLMs) and demonstrates state-of-the-art performance in spoken question answering datasets.
The Spectron model combines the encoder of a speech recognition model with a pre-trained Transformer-based decoder language model. During training, speech utterances are split into a prompt and its continuation. The full transcript, including the prompt and continuation, is reconstructed alongside the continuation’s speech features. During inference, only a prompt is provided, and the model generates the prompt’s transcription, text continuation, and speech continuations.
The architecture of Spectron includes a speech encoder and a language model decoder. The speech encoder is a pre-trained 600M-parameter conformer encoder that incorporates both linguistic and acoustic information from the input spectrogram. The language model decoder, on the other hand, is a 350M or 1B parameter decoder trained using the PaLM 2 method. The decoder is teacher-forced during training to predict the text transcription, text continuation, and speech embeddings. Spectron also uses pre- and post-networks, which are lightweight modules that convert speech embeddings to and from spectrograms.
To evaluate the performance of Spectron, experiments were conducted on the Libri-Light dataset. The results showed that Spectron outperforms existing spoken language models in terms of log-perplexity, mean opinion score (MOS), and speaker similarity. It achieves higher cohesion and semantic quality in generated speech, sounds more natural to human evaluators, and retains the speaker similarity to the input speech.
In conclusion, Spectron is a breakthrough in spoken language modeling that directly processes spectrograms and generates high-quality spoken language. With its unique architecture and training methodology, Spectron achieves state-of-the-art performance in speech continuation and spoken question answering tasks.