The Significance and Features of N-Gram Language Models Modernized for the Era of Neural LLMs
In a recent paper, the University of Washington and Allen Institute for Artificial Intelligence have explored the relevance of n-gram language models to neural LLMs and introduced advanced modernization techniques for these traditional language models.
Modernization of N-Gram LMs
Traditional n-gram LMs have been modernized by scaling training data to an unprecedented 1.4 trillion tokens, making it the largest n-gram LM to date. The concept of an ∞-gram LM, with unbounded n and utilizing a backoff variant, has been introduced for improved accuracy.
Efficiency and Optimization
The ∞-gram LM leverages a suffix array, replacing impractical n-gram count tables and achieving remarkable efficiency with 7 bytes of storage per token. The paper also outlines efficient methods for n-gram counting, occurrence position retrieval, and document identification, along with clever optimizations, such as reusing search results and on-disk search, to enhance speed of computation.
Applications and Outlook
Infini-gram’s application across diverse neural LMs has demonstrated consistent perplexity improvements, underscoring its efficacy in complementing neural LMs across different model series. Lastly, the paper presents preliminary applications of the Infini-gram engine, opening up diverse possibilities, from understanding text corpora to mitigating copyright infringement