The Breakthrough in Generative Modeling: Perceiver AR
Autoregressive Transformers are making strides in generative modeling, predicting one element after another in creating images, text, audio, and more. However, training deep Transformers on sequences requires costly computation and limits input size.
Enter the Perceiver models. These models are solving the input size problem by efficiently handling up to 100,000 elements using cross-attention to encode inputs into a latent space. This allows for a fixed cost at every layer, regardless of input size.
The Perceiver AR is an architecture that is proving to be a game-changer in handling longer inputs, outperforming standard Transformers and Transformer-XL models. A 60-layer Perceiver AR with a context length of 8192 is even surpassing a 42-layer Transformer-XL on a book-length generation task.
The advantages of Perceiver AR go beyond just handling longer inputs. The model can adapt the compute budget to spend more or less for improved generation, outperforms Transformer-XL with greater context even with the same compute budget, and exhibits less sensitivity in generating elements with a left-to-right ordering.
For music lovers, Perceiver AR can generate new pieces of music with coherence in melodic, harmonic, and rhythmic structure. The possibilities with Perceiver AR are exciting and endless.
To learn more about using Perceiver AR, check out the JAX code for training on Github, read the paper on arXiv, and watch the spotlight presentation at ICML 2022. And don’t miss the Google Magenta blog post with more music generated by Perceiver AR!