Revolutionizing Natural Language Processing with Smaller Models: The Distilling Step-by-Step Approach
In recent years, artificial intelligence has made significant advancements in natural language processing. Large language models (LLMs) have played a crucial role in enabling zero-shot and few-shot learning capabilities. However, their deployment in real-world applications has been limited due to their high computational demands. For example, a single 175 billion parameter LLM requires 350GB of GPU memory and specialized infrastructure. With larger models boasting over 500 billion parameters, these requirements make LLMs inaccessible to many research teams, especially those with low-latency performance needs.
Addressing the Deployment Challenge with Specialized Models
To tackle this deployment challenge, researchers have explored smaller specialized models trained through fine-tuning or distillation techniques. Fine-tuning relies on costly and time-consuming human-generated labels, while distillation requires large amounts of unlabeled data, which can be hard to obtain.
A research team from Google and the University of Washington presented a groundbreaking study at ACL2023. They introduced a novel mechanism called “Distilling Step-by-Step” to mitigate the trade-off between model size and data collection cost. This innovative approach involves extracting informative natural language rationales, or intermediate reasoning steps, from LLMs. These rationales serve as additional supervision in training smaller task-specific models alongside standard task labels.
The Distilling Step-by-Step Process
The researchers propose a two-stage process for implementing Distilling Step-by-Step. First, they use CoT prompting to extract rationales from an LLM. This allows the model to generate rationales for unseen inputs. Then, these rationales are integrated into the training of small models using a multi-task learning framework, with task prefixes guiding the model’s differentiation between label prediction and rationale generation.
Impressive Performance Gains with Reduced Data Requirements
In a series of experiments, the researchers utilized a 540 billion parameter LLM and T5 models for task-specific downstream tasks. Distilling Step-by-Step demonstrated remarkable performance gains with significantly reduced data requirements. For example, on the e-SNLI dataset, the method outperformed standard fine-tuning using only 12.5% of the full dataset. Similar reductions in dataset size were observed across various NLP tasks, including ANLI, CQA, and SVAMP.
Furthermore, Distilling Step-by-Step achieved superior performance using considerably smaller model sizes compared to few-shot CoT-prompted LLMs. For example, on the e-SNLI dataset, a 220 million parameter T5 model surpassed the performance of a 540 billion parameter PaLM. On ANLI, a 770 million parameter T5 model outperformed a 540 billion parameter PaLM by over 700 times, demonstrating the potential for efficiency gains.
Importantly, Distilling Step-by-Step showcased its ability to outperform few-shot LLMs using significantly smaller models and less data. For example, on ANLI, a 770 million parameter T5 model achieved better performance than a 540 billion parameter PaLM using only 80% of the full dataset, which is not possible with standard fine-tuning.
The Distilling Step-by-Step approach introduces a groundbreaking paradigm for training small, task-specific models. By extracting rationales from LLMs, this approach reduces the data required for model training and enables the use of significantly smaller models. This innovative technique has the potential to revolutionize the field of natural language processing, making advanced language models more accessible and practical for a broader range of applications.
If you enjoy our work, don’t forget to subscribe to our newsletter here to stay updated with the latest AI research news, cool AI projects, and more.