How to Increase the Efficiency of Transformer Networks
Google researchers Xin Wang and Nishanth Dikkala recently introduced a method to take advantage of increased token representation without increasing the computation cost in a way that is easy to implement, widely applicable, and requires minimal hyperparameter tuning.
Understanding Transformers
First, transformers divide the input into a sequence of tokens and then map each token to a token embedding. The Transformer then operates on these token embeddings using computation modules called layers. To achieve benefits of scale without increasing the compute burden, prior works have predominantly focused on efficiently scaling up the network parameters by conditionally activating a subset based on the input. Recent works have also established that a wider token representation helps in learning more complicated functions, but widening the representation vector requires increasing the model dimension which quadratically increases the amount of computation in feedforward computation.
Introducing AltUp
AltUp works by partitioning a widened representation vector into equal-sized blocks, processing only a single block at each layer, and using an efficient prediction-correction mechanism. This allows AltUp to keep the model dimension constant and take advantage of using an increased token dimension. The predictor and corrector computations only involve a limited number of vector additions and multiplications and incur negligible computation cost compared to the transformer.
Impressive Results
When evaluated on T5 models on various benchmark language tasks, models augmented with AltUp are uniformly faster than the extrapolated dense models at the same accuracy. For example, a T5 Large model augmented with AltUp leads to a 27%, 39%, 87%, and 29% speedup on GLUE, SuperGLUE, SQuAD, and Trivia-QA benchmarks, respectively.
Overall, AltUp consistently leads to models with better predictive performance than the corresponding baseline models with the same speed on all evaluated model sizes and benchmarks.
When applied to larger models, AltUp also demonstrated improved performance, proving its scalability and effectiveness.
In conclusion, AltUp is a groundbreaking method for making transformer networks more efficient without increasing the compute burden, and its impressive results make it a promising tool for future natural language processing and other AI applications.