The Open-Source Linguistic Models
There are many open-source projects that have developed linguistic models capable of carrying out specific tasks. These models can provide helpful responses to user questions and commands. Some notable examples include Alpaca and Vicuna, which are LLaMA-based, and OpenAssistant and Dolly, which are Pythia-based.
The Struggle to Benchmark LLMs
Despite the constant release of new models, the community still faces challenges when it comes to benchmarking them effectively. It is difficult to create a benchmarking system that can automatically evaluate the quality of answers provided by LLM assistants, as their concerns are often vague. Human evaluation through pairwise comparison is often necessary. Ideally, a scalable, incremental, and distinctive benchmark system based on pairwise comparison would be ideal.
Shortcomings of Current Benchmarking Systems
Most of the current benchmarking systems for LLMs do not meet all the necessary requirements. Traditional benchmark frameworks like HELM and lm-evaluation-harness offer multi-metric measures for research-standard tasks but do not evaluate free-form questions well because they do not rely on pairwise comparisons.
Introducing Chatbot Arena
LMSYS ORG, an organization known for developing large, open, scalable, and accessible models and systems, has launched Chatbot Arena. This platform serves as a crowdsourced LLM benchmark platform with anonymous and randomized battles. The Elo rating system, similar to that used in chess and other competitive games, is employed in Chatbot Arena.
How Chatbot Arena Works
The arena was opened a week ago with well-known open-source LLMs. Real-world applications of LLMs can be seen through the crowdsourcing data collection method. Users can compare and contrast two anonymous models by chatting with them simultaneously in the arena.
The arena is hosted by FastChat, a multi-model serving system, which can be accessed at https://arena.lmsys.org. When users enter the arena, they engage in a conversation with two nameless models. After receiving comments from both models, users can continue the conversation or vote for their preferred model. Once a vote is cast, the models’ identities are revealed. Users can then choose to continue chatting with the same models or start a new battle with different models. All user activity is recorded, and the model names are obfuscated in the analysis of the votes. Since its launch a week ago, the arena has received around 7,000 legitimate, anonymous votes.
Future Improvements to Chatbot Arena
In the future, the developers of Chatbot Arena plan to implement improved sampling algorithms, tournament procedures, and serving systems. These enhancements will allow for a wider range of models and provide more detailed rankings for different tasks.
In conclusion, LMSYS ORG’s Chatbot Arena offers a unique and effective benchmarking platform for LLMs. With its anonymous and randomized battles, as well as the Elo rating system, it shows promise for evaluating the quality of LLM assistants. Explore the paper, code, and project details for more information. Also, don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and cool projects.