The Significance of Optimal Training Setup for Large Language Models
In recent years, there has been a focus on improving language models by increasing the number of parameters in transformer-based models. This has led to impressive results across many natural language processing tasks. DeepMind showcased a 280-billion parameter model called Gopher, which achieved leading performance in language modeling and question answering. An even larger model named Megatron-Turing NLG with 530 billion parameters has been published.
The cost of training these large models is substantial, so it is important to find the best training setup to avoid wasting resources. The training compute cost for transformers depends on the model size and the number of training tokens. In this study, the researchers investigate the optimal tradeoff between increasing model size and the amount of training data with increasing computational resources.
The findings suggest that current large language models are too large for their compute budget and are not being trained on enough data. It was discovered that a smaller model trained on more data would have been preferable. The researchers tested their hypothesis by training a 70-billion parameter model called Chinchilla, which outperformed larger models like Gopher on many tasks.
After the release of Chinchilla, a model named PaLM was released with 540 billion parameters and trained on 768 billion tokens. Despite not being compute-optimal, it outperformed Chinchilla on a range of tasks. This suggests that a 140-billion-parameter model trained on 3 trillion tokens may be optimal and more efficient for inference.
Smaller, more performant models also benefit from reduced inference time and memory costs, making querying the models faster and more achievable on less hardware. In practice, the cost of using Chinchilla is substantially smaller, in addition to it performing better. Further simple optimizations may be possible that are able to continue to provide large gains.