Unleashing Infini-gram: Transforming N-Gram Language Models for Modern Text Analysis

AI News

Unleashing Infini-gram: Transforming N-Gram Language Models for Modern Text Analysis

Jimmy W.

February 13, 2024

Unleashing Infini-gram: Transforming N-Gram Language Models for Modern Text Analysis

The Significance and Features of N-Gram Language Models Modernized for the Era of Neural LLMs

In a recent paper, the University of Washington and Allen Institute for Artificial Intelligence have explored the relevance of n-gram language models to neural LLMs and introduced advanced modernization techniques for these traditional language models.

Modernization of N-Gram LMs

Traditional n-gram LMs have been modernized by scaling training data to an unprecedented 1.4 trillion tokens, making it the largest n-gram LM to date. The concept of an ∞-gram LM, with unbounded n and utilizing a backoff variant, has been introduced for improved accuracy.

Efficiency and Optimization

The ∞-gram LM leverages a suffix array, replacing impractical n-gram count tables and achieving remarkable efficiency with 7 bytes of storage per token. The paper also outlines efficient methods for n-gram counting, occurrence position retrieval, and document identification, along with clever optimizations, such as reusing search results and on-disk search, to enhance speed of computation.

Applications and Outlook

Infini-gram’s application across diverse neural LMs has demonstrated consistent perplexity improvements, underscoring its efficacy in complementing neural LMs across different model series. Lastly, the paper presents preliminary applications of the Infini-gram engine, opening up diverse possibilities, from understanding text corpora to mitigating copyright infringement

Source link

Modernization of N-Gram LMs

Efficiency and Optimization

Applications and Outlook

LEAVE A REPLY Cancel reply