The Significance of Extending Context Length in Language Models
When it comes to language models, the ability to handle longer contexts is essential. However, it’s still unclear whether these models can be extended to accommodate longer contexts. In an effort to answer this question, researchers at Abacus AI conducted a series of experiments using different techniques to develop the context length ability of Llama, a model pre-trained on a context length of 2048. By linearly rescaling these models with IFT at scales 4 and 16, they were able to achieve performance on tasks with context lengths of up to 16k or even 20-24k.
Methods of Extending Context Length
There are several methods that can be used to extend the context length of language models. These include linear scaling, scaling the Fourier basis of Rotatory Position Embedding (RoPE) by a power, truncating the Fourier basis, and randomizing the position vector. In the experiments conducted by Abacus AI, they applied these methods to the RedPajama and Vicuna datasets during fine-tuning. They found that linear scaling was effective in increasing context length, while truncation and randomization had higher perplexity scores but performed less well in retrieval tasks.
Evaluating the Models
To evaluate the extended context length models, the researchers used datasets from LMSys, open-book question-answering datasets, and WikiQA. The LMSys datasets were used for locating substrings in the context, while the WikiQA task involved answering questions based on information in a Wikipedia document.
To further evaluate the models, the team created a QA task based on the short answer format data in Google Natural Questions. By placing the answer in different locations within the expanded context length, they were able to effectively evaluate every part of the model. Additionally, they created versions of the same Wikipedia document with varying sizes to ensure fair evaluation across model sizes.
Addressing Pre-Trained Text Bias
One issue with using a Wikipedia-based dataset is that the model tends to answer from its pre-trained written texts rather than the context. To address this bias, the researchers created an altered dataset consisting of questions with only numerical answers. They replaced the answers and every occurrence of the response in the document with different numbers. This ensured that the model would provide incorrect answers if it relied solely on its pre-trained texts. The original QA task was referred to as Free Form QA (FFQA), while the altered task was named Altered Numerical QA (AltQA).
Evaluating the Presence Accuracy on every example in both versions of the QA tasks, the researchers found that an increase in accuracy by IFT did not extend the range of context lengths the model could handle. However, when using scaled context with IFT, there was a significant improvement in performance. In fact, the researchers observed a 2x improvement in FFQA and a 2.5x improvement in AltQA at all positions interpolated by the scaled context factor. Ultimately, these findings suggest the potential benefits of larger-context language models in capturing the theme of a document more effectively.
All credit for this research goes to the researchers on this project. Don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter for the latest AI research news and projects.