Posted by Zalán Borsos, Research Software Engineer, and Marco Tagliasacchi, Senior Staff Research Scientist, Google Research
Generative AI has made significant progress in creating new content in various domains like text, vision, and audio. To achieve this, raw data is converted into a compressed format using neural audio codecs. These compressed representations consist of a sequence of discrete audio tokens that capture the local properties and temporal structure of sounds. This breakthrough has enabled advancements in speech continuation, text-to-speech, and general audio/music generation.
However, traditional auto-regressive decoding methods used in generative audio models like AudioLM have limitations in terms of slow inference speed for long sequences. To address this issue, we propose a new method called SoundStorm for efficient and high-quality audio generation. SoundStorm uses a special architecture adapted to audio tokens produced by neural codecs and a decoding scheme inspired by MaskGIT, a method for image generation.
Compared to traditional approaches, SoundStorm can generate audio tokens in parallel, resulting in a 100x decrease in inference time for long sequences. It also maintains the quality and consistency of the generated audio. Additionally, when combined with the text-to-semantic modeling of SPEAR-TTS, SoundStorm can synthesize high-quality, natural dialogues with control over spoken content, speaker voices, and speaker turns.
The design of SoundStorm involves a bidirectional attention-based Conformer, which combines a Transformer with convolutions to capture both local and global structure. It predicts audio tokens using residual vector quantization. During inference, SoundStorm starts with masked audio tokens and fills them in over multiple iterations, gradually increasing the level of detail. This approach allows for parallel token prediction and minimizes complexity.
We have demonstrated that SoundStorm matches the quality of AudioLM’s acoustic generator and produces audio 100x faster. It achieves improved consistency in terms of speaker identity and acoustic conditions. We acknowledge potential biases in the training data and intend to address them responsibly. SoundStorm-generated audio remains detectable by a dedicated classifier, reducing the risk of misuse.
In conclusion, SoundStorm is a powerful model for efficient and high-quality audio generation. It outperforms traditional methods in terms of speed and consistency. By combining it with other models like SPEAR-TTS, we can create natural dialogues with control over various aspects. We believe SoundStorm will open up new possibilities for audio generation research with reduced computational requirements.