Addressing computational inefficiency: vLLM – a faster and cheaper alternative for large language models

AI News

Addressing computational inefficiency: vLLM – a faster and cheaper alternative for large language models

Jimmy W.

June 25, 2023

Addressing computational inefficiency: vLLM – a faster and cheaper alternative for large language models

Large language models, also known as LLMs, are a revolutionary advancement in artificial intelligence (AI). These models, like GPT-3, have completely changed how we understand natural language. They have the ability to interpret vast amounts of data and generate human-like texts, opening up new possibilities for human-machine interaction. However, LLMs face a significant challenge – they are computationally inefficient and can be slow even on powerful hardware. Training these models requires extensive computational resources, memory, and processing power, making them impractical for real-time applications. To address this challenge, researchers at the University of California, Berkeley, have developed vLLM, an open-source library that offers a simpler, faster, and cheaper alternative for LLM inference and serving.

By using vLLM as the backend for their Vicuna and Chatbot Arena, the Large Model Systems Organization (LMSYS) has been able to efficiently handle peak traffic and reduce operational costs. The library supports popular models like GPT-2 and achieves throughput levels significantly higher than HuggingFace Transformers.

One of the primary constraints on LLM performance is memory. To tackle this issue, the Berkeley researchers introduced PagedAttention, a novel attention algorithm that optimizes memory usage. PagedAttention stores attention key and value tensors in non-contiguous memory spaces, reducing memory wastage. Additionally, it allows for efficient memory sharing during parallel sampling, resulting in a significant increase in throughput.

In summary, vLLM effectively manages attention key and value memory through the PagedAttention mechanism. It offers exceptional performance and seamlessly integrates with popular models. The library is available for both offline inference and online serving.

You can check out the blog article and the Github for more information. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and projects. If you have any questions, feel free to email us at Asif@marktechpost.com.

Source link

LEAVE A REPLY Cancel reply