Unleashing Infini-gram: Transforming N-Gram Language Models for Modern Text Analysis

The Significance and Features of N-Gram Language Models Modernized for the Era of Neural LLMs

In a recent paper, the University of Washington and Allen Institute for Artificial Intelligence have explored the relevance of n-gram language models to neural LLMs and introduced advanced modernization techniques for these traditional language models.

Modernization of N-Gram LMs

Traditional n-gram LMs have been modernized by scaling training data to an unprecedented 1.4 trillion tokens, making it the largest n-gram LM to date. The concept of an ∞-gram LM, with unbounded n and utilizing a backoff variant, has been introduced for improved accuracy.

Efficiency and Optimization

The ∞-gram LM leverages a suffix array, replacing impractical n-gram count tables and achieving remarkable efficiency with 7 bytes of storage per token. The paper also outlines efficient methods for n-gram counting, occurrence position retrieval, and document identification, along with clever optimizations, such as reusing search results and on-disk search, to enhance speed of computation.

Applications and Outlook

Infini-gram’s application across diverse neural LMs has demonstrated consistent perplexity improvements, underscoring its efficacy in complementing neural LMs across different model series. Lastly, the paper presents preliminary applications of the Infini-gram engine, opening up diverse possibilities, from understanding text corpora to mitigating copyright infringement

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...