Large language models are making waves with their incredible capabilities in various fields. Every day, there’s a new research paper or update about these models. However, these models come with a hefty price tag due to their massive number of parameters and the extensive training required. The training process involves working with trillions of tokens, making it extremely expensive.
Researchers from Stanford University and Cornell University recently published a research paper proposing a solution to the costly nature of large language models. They focused on the challenge of processing large documents using Language Models (LMs), citing the high cost of running inference over 55 million Wikipedia pages. The cost for this task exceeds $100,000, which equates to more than $0.002 per 1000 tokens. The researchers’ approach aims to reduce inference costs by a factor of 110 while also enhancing the quality of the results compared to direct inference on each document.
The proposed system, called EVAPORATE, utilizes LLMs to implement two different strategies. The first strategy prompts the LLM to directly extract values from documents, while the second strategy prompts the LLM to synthesize code for the extraction process. The researchers evaluated these approaches and found that code synthesis was cheaper but less accurate than direct document processing with the LLM.
EVAPORATE addresses the challenge of expensive LLMs by identifying redundancies across multiple documents and leveraging them to enhance efficiency. In one example, the team illustrated how EVAPORATE can extract the device classification attribute from FDA reports for medical devices. Instead of processing each semi-structured document individually with the LLM, they explored using the LLM to generate reusable functions for extraction.
To improve quality and minimize costs, the team proposed an extended code synthesis implementation called EVAPORATE-CODE+. This approach generates multiple candidate functions and combines their extractions using weak supervision. While weak supervision is typically applied to human-generated functions, EVAPORATE-CODE+ operates with machine-generated functions and tackles the challenges associated with this setup to achieve quality improvements.
The researchers evaluated EVAPORATE on 16 sets of documents spanning different formats, topics, and attribute types. EVAPORATE-CODE+ outperformed state-of-the-art systems by reducing the number of tokens the LLM processed by 110 times, on average, across the 16 evaluation settings of 10,000 documents each.
In conclusion, this research paper presents a promising solution for automating table extraction from semi-structured documents using LLMs. By analyzing the tradeoffs between direct extraction and code synthesis and introducing an extended implementation that improves quality while maintaining affordability, this work is a notable step forward for the data management community.
For more details, you can check out the paper ([link to the paper]) and the repository ([link to the repository]). Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and exciting projects. If you have any questions or if we missed anything, feel free to reach out to us at Asif@marktechpost.com.
Check out AI Tools Club for a variety of AI tools ([link to AI Tools Club]).