Large Language Models (LLMs) have become immensely popular in the AI community. They are great at text summarization, question answering, and content generation. However, these models are often trained on noisy and unstructured web data, which presents a significant challenge.
Apple and Carnegie Mellon researchers introduced Web Rephrase Augmented Pre-training (WRAP) to address this. WRAP uses an instruction-tuned LLM to paraphrase web pages in different styles. This improves pre-training efficiency and model performance, while preserving the quality of the original information.
Efficiency is improved with WRAP, accelerating pre-training times by three times. Model performance also gets a boost, with ambiguity significantly reduced and question-answer accuracy improving for different tasks. In addition, WRAP refines web document language, preparing LLMs for real-world scenarios. In conclusion, WRAP not only expedites LLM training, but also enhances the models’ overall performance.
With WRAP, Apple and Carnegie Mellon are redefining LLM pre-training. The method, through the use of high-quality synthetic data, is making LLM training more efficient and effective. The team’s work is a significant leap forward in LLM dialogue completion and code completion tasks.
The paper, “Web Rephrase Augmented Pre-training”, outlines the Apple and Carnegie research. Stay up-to-date with this project by following them on Twitter, Google News, and their 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group. If you like their work, subscribe to their newsletter and join their Telegram Channel.