Title: Optimizing AI Language Models: Finding the Sweet Spot
In recent years, there has been a focus on improving language models by increasing the number of parameters in transformer-based models. This approach has led to impressive results in natural language processing tasks. DeepMind also joined this research direction and introduced Gopher, a 280-billion parameter model that achieved state-of-the-art performance in various tasks. Now, an even bigger model called Megatron-Turing NLG with 530 billion parameters has been published.
Optimal Tradeoff Between Model Size and Training Data:
Training these large models incurs substantial costs, so it is crucial to find the best training setup to avoid wasting resources. The cost is determined by two factors: model size and the number of training tokens. The current generation of large language models allocates more resources to increasing model parameters while keeping the training data size at around 300 billion tokens.
To investigate the optimal tradeoff between model size, training tokens, and computational resources, we trained models of different sizes and token counts. This led us to ask the question: “What is the best model size and number of training tokens for a given compute budget?”
Findings and Analysis:
We discovered that current large language models are excessively large for their compute budget and not trained on enough data. Surprisingly, we found that a model training FLOPs equivalent to Gopher could have been 4 times smaller and trained on 4 times more data while being more preferable.
To test our data scaling hypothesis, we trained Chinchilla – a 70-billion parameter model with 1.3 trillion tokens. Despite having fewer parameters compared to Gopher, Chinchilla outperformed Gopher and other large language models in almost every task measured.
Impact on Performance:
Our analysis confirmed that increasing model size doesn’t always lead to better performance. Chinchilla’s 70 billion parameters outperformed Gopher’s 280 billion parameters. Similarly, a model named PaLM with 540 billion parameters and 768 billion tokens, trained on a higher compute budget, outperformed Chinchilla as well.
Optimal Model Size and Training Data:
Considering the PaLM compute budget, our methods predict that a 140-billion-parameter model trained on 3 trillion tokens would be optimal and more efficient for inference.
Advantages of Smaller Models:
Smaller, more efficient models like Chinchilla have the added benefit of reducing inference time and memory costs, making them faster and compatible with lesser hardware. Moreover, further optimizations could potentially provide additional gains.
The findings of our research indicate that current large language models have a size-exceeding compute budget and insufficient training data. Smaller models like Chinchilla show superior performance while being more resource-efficient. By finding the optimal model size and training data, we can maximize the effectiveness of AI language models while minimizing resource wastage.