Autoregressive Transformers have made significant advancements in generative modeling. These models predict each element of a sample, such as pixels in an image or characters in a text, one at a time. However, the cost of each layer in a Transformer increases as more elements are used as input, limiting the length of the sequences they can handle to about 2,048 elements.
On the other hand, our Perceiver models have shown excellent performance on real-world tasks with sequences of up to 100,000 elements. Perceivers use cross-attention to encode inputs into a latent space, which reduces the compute requirements and allows for deeper models. Regardless of input size, Perceivers have a fixed cost at each layer.
While Perceivers can handle all elements in one pass, autoregressive generation processes elements one by one. To address this limitation, Perceiver AR proposes a simple solution: aligning the latents with the final elements of the input and masking the input, so the latents only see earlier elements.
This results in an architecture that can handle inputs up to 50 times longer than standard Transformers, while being as easy to deploy.
Perceiver AR outperforms both standard Transformers and Transformer-XL models at different sequence lengths. It scales considerably better with size and allows for very effective long-context models. For instance, a 60-layer Perceiver AR with a context length of 8192 performs better than a 42-layer Transformer-XL on a book-length generation task, while being faster in terms of wall-clock time.
Perceiver AR achieves state-of-the-art results on image, language, and music generation benchmarks. By decoupling input size from the compute budget, Perceiver AR offers the flexibility to adapt the compute budget at evaluation time for either improved quality or reduced generation. It also exhibits less sensitivity to the order of generating elements, making it suitable for data like images that have multi-dimensional structure.
We trained Perceiver AR using a dataset of piano music and were able to generate new pieces. Since each note is predicted based on the sequence before it, Perceiver AR produces music with a high level of coherence in melody, harmony, and rhythm.
To learn more about using Perceiver AR, you can download the JAX code for training it on Github, read our paper on arXiv, or watch our spotlight presentation at ICML 2022. Additionally, you can check out the Google Magenta blog post for more music created using Perceiver AR.