Large Language Models (LLMs) have gained significant attention for their impressive performance in a wide range of tasks. They can generate unique content, answer questions, translate languages, and summarize text, showcasing their ability to imitate human behavior. Well-known LLMs like GPT, BERT, and PaLM have garnered attention for their accurate results and access to vast amounts of high-quality data. However, models like GPT4 and PaLM are not open-source, making it difficult for researchers to understand their architectures and training data.
On the other hand, open-source LLMs like Pythia, LLaMA, and Flan-T5 offer researchers the opportunity to improve and customize the models based on their own instruction datasets. This has led to the development of smaller and more efficient LLMs such as Alpaca, Vicuna, OpenAssistant, and MPT.
While there is no single open-source LLM that dominates the market, different LLMs excel in various scenarios. Therefore, it is crucial to dynamically combine these LLMs to continuously improve the quality of the outputs. By integrating the strengths of multiple LLMs and reducing biases, errors, and uncertainties, the ensembling framework called LLM-BLENDER, proposed by researchers from the Allen Institute for Artificial Intelligence, the University of Southern California, and Zhejiang University, consistently achieves superior performance.
LLM-BLENDER consists of two modules – PAIRRANKER and GENFUSER. PAIRRANKER identifies subtle differences among potential outputs by using a pairwise comparison technique. It leverages advanced encoders like RoBERTa to jointly encode the original text and two candidate outputs from different LLMs. PAIRRANKER then determines the quality of the candidate pair based on this encoding.
GENFUSER, the second module, focuses on merging the top-ranked candidates to generate an improved final output. It maximizes the advantages of the selected candidates while minimizing their disadvantages. By merging the outputs of multiple LLMs, GENFUSER creates an output that surpasses the performance of any single LLM.
To evaluate the effectiveness of LLM-BLENDER, the research team has provided a benchmark dataset called MixInstruct. This dataset incorporates Oracle pairwise comparisons and combines various instruction datasets. It includes training, validation, and test examples with Oracle comparisons for automatic evaluation. The performance of LLM-BLENDER and other benchmark techniques can be assessed based on the ground truth rankings provided by these oracle comparisons.
Experimental results have shown that LLM-BLENDER outperforms individual LLMs and baseline techniques in various evaluation parameters. It establishes a significant performance gap and demonstrates superior quality compared to using a single LLM or baseline method. PAIRRANKER’s selections have proven to outperform individual LLM models, thanks to their better performance in reference-based metrics and GPT-Rank. By utilizing PAIRRANKER’s top picks through efficient fusion, GENFUSER significantly improves the quality of the final responses.
LLM-BLENDER has also showcased its potential by outperforming individual LLMs like Vicuna. This ensembling methodology holds great promise for enhancing LLM deployment and research through ensemble learning.
For more details on LLM-BLENDER, you can refer to the paper, the project website, and the GitHub repository. Join our AI Tools Club for access to hundreds of AI tools. Don’t forget to subscribe to our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and cool AI projects. If you have any questions or feedback, feel free to reach out to us via email.