Legal concerns have been raised about large language models (LMs) because they are often trained on copyrighted content. This creates a tradeoff between legal risk and model performance. Training LMs only on permissively licensed or publicly available data negatively impacts accuracy. This is because these datasets are limited and mainly consist of copyright-expired books, government records, and permissively licensed code.
A recent study by the University of Washington, UC Berkeley, and Allen Institute for AI proposes a solution to this tradeoff. They suggest splitting the training data into parametric and nonparametric subsets to improve the risk-performance balance. The team trains the LM parameters on low-risk data and uses a nonparametric component (a datastore) only during inference. The high-risk data can be retrieved from the nonparametric datastore to enhance model predictions after the training phase. Developers have the flexibility to remove their data from the datastore, even down to individual examples, and the datastore can be easily updated at any time. This method also attributes model predictions to data contributors at the sentence level. These updated features allow the model to align more accurately with various data-use restrictions. In contrast, parametric models make it difficult to remove high-risk data after training and struggle to attribute data at scale.
To implement their suggestion, the researchers developed SILO, a novel nonparametric language model. They used a novel pretraining corpus called OPEN LICENSE CORPUS (OLC) for the parametric component of SILO. The OLC corpus is rich in various domains, with a heavy focus on code and government text. This presents a challenge of generalizing the model trained on narrow domains. The team trained three 1.3B-parameter LMs on different subsets of OLC. They also built a test-time datastore that can incorporate high-risk data and retrieve its contents for inference. They compared a retrieval-in-context approach (RIC-LM) that retrieves text blocks and feeds them to the parametric LM in context with a nearest-neighbors approach (kNN-LM) that uses a nonparametric next-token prediction function.
The researchers evaluated SILO against Pythia, a parametric LM designed for use with high-risk data. They measured perplexity in language modeling across 14 domains, including in-domain and OLC-specific data. The results showed that parametric-only SILO performs well in domains covered by OLC but struggles out of the domain. However, supplementing SILO with an inference-time datastore significantly improves out-of-domain performance. Both kNN-LM and RIC-LM contribute to this improvement, but kNN-LM performs better overall. It allows SILO to achieve similar performance to the Pythia baseline in all domains by an average of 90%. The analysis revealed that the nonparametric next-token prediction in kNN-LM is resistant to domain shift and benefits from a growing data store.
In conclusion, this study demonstrates the potential of expanding the size of the nonparametric datastore and improving the nonparametric model to further close any remaining performance gaps in a few domains where SILO has not yet achieved Pythia’s levels of performance.