Distilling Step-by-Step: Training Smaller Models with Less Data to Outperform LLMs

How to Train Smaller AI Models with Less Data: A Breakthrough in Natural Language Processing


Large language models (LLMs) have revolutionized the field of AI, allowing us to solve new tasks with minimal training data. However, deploying these models in real-world applications can be challenging due to their immense size and computational requirements. To overcome these challenges, researchers have turned to smaller specialized models trained through fine-tuning or distillation methods. Unfortunately, these approaches can be costly and require large amounts of data. In this article, we introduce a new approach called distilling step-by-step, which allows us to train smaller models with significantly less data while still achieving high performance.

What is Distilling Step-by-Step?

Distilling step-by-step is a simple yet powerful mechanism that extracts informative natural language rationales from LLMs. These rationales act as intermediate reasoning steps and provide valuable task knowledge. For example, when solving a math problem, an LLM can generate rationales like “Area = length * width” to better explain the solution process. We then use these rationales to train smaller task-specific models, enhancing their performance without the need for extensive training data.

How Does Distilling Step-by-Step Work?

1. Rationale Extraction: The first stage involves using the CoT prompting technique to extract rationales from LLMs. By providing examples with rationales in the prompt, LLMs can output corresponding rationales for new inputs, improving their in-context learning abilities.

2. Multi-Task Learning: In the second stage, we train the small models with a rationale generation task alongside the standard label prediction task. This enables the models to learn the reasoning steps and make better predictions. By using task prefixes in the input examples, the models can differentiate between the two tasks.

Experimental Results:

We conducted experiments using a 540B PaLM model as the LLM and T5 models as the task-specific models. The results showed that distilling step-by-step outperforms standard fine-tuning while using much less training data. For example, on the e-SNLI dataset, we achieved better performance than standard fine-tuning with only 12.5% of the full dataset. Additionally, distilling step-by-step significantly reduces model size compared to few-shot prompted LLMs. A 770M T5 model achieved better performance than the 540B PaLM model on the ANLI dataset, which is over 700 times smaller.


Distilling step-by-step offers a novel approach to training smaller AI models with less data. By extracting rationales from large language models and incorporating them into the training process, we can achieve high performance while reducing the computational requirements and training data needed. This breakthrough has the potential to revolutionize natural language processing and make AI more accessible for various applications.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...