The Significance of Training Stability in Transformers
Training stability is crucial for Transformers, a type of artificial intelligence (AI) model. In this study, the focus is on understanding the training dynamics of Transformers by examining the evolution of the attention layers. Attention entropy, a measure of model sharpness, is tracked for each attention head during training. A common pattern emerges across different architectures and tasks: low attention entropy is associated with high training instability, which can manifest as oscillating loss or divergence. This phenomenon, referred to as entropy collapse, poses a challenge in achieving reliable training outcomes.
The Proposed Solution:
To address entropy collapse, we present sigmaReparam, a simple and efficient remedy. This solution involves reparametrizing all linear layers with spectral normalization and an additional learned scalar. By implementing this reparameterization, entropy collapse in the attention layers can be prevented, leading to more stable training. An important aspect of our approach is the establishment of a tight lower bound for attention entropy. This lower bound decreases exponentially with the spectral norm of the attention logits, providing further motivation for our proposed solution.
We conducted experiments using sigmaReparam on various tasks, including image classification, image self-supervised learning, machine translation, automatic speech recognition, and language modeling. These experiments involved different Transformer architectures. The results demonstrate that sigmaReparam enhances stability and robustness, even when hyperparameters are not optimized. Notably, sigmaReparam enabled training a Vision Transformer to achieve competitive performance without the need for warmup, weight decay, layer normalization, or adaptive optimizers. Similarly, deep architectures in machine translation and speech recognition achieved competitive performance without warmup and adaptive optimizers.
See Figure 1 for a visual representation of how Transformers are sensitive to hyperparameters. The learning rate plays a significant role, as increasing it can easily lead to attention entropy collapse and training divergence.
Training stability is a critical factor in the success of Transformers. Through the analysis of attention layers and the introduction of sigmaReparam, we have demonstrated a solution to prevent entropy collapse and promote more stable training. The experimental results across various tasks and architectures emphasize the effectiveness and versatility of sigmaReparam. By addressing this challenge, we open the door to further advancements in AI and the successful implementation of Transformers in real-world applications.