Recent advancements in language models (LMs) have primarily focused on improving performance on static benchmarks, neglecting the dynamic nature of language and knowledge. To address this limitation and enhance the evaluation of question-answering models, we have introduced a new approach that considers temporal dynamics.
In 2021, we published the paper “Mind the Gap: Assessing Temporal Generalization in Neural Language Models” and introduced dynamic language modelling benchmarks for WMT and arXiv. These benchmarks take into account the evolving nature of language and knowledge, providing a more realistic evaluation of model performance. We discovered that knowledge-intensive tokens significantly impact the performance of current state-of-the-art large LMs.
Building on this research, we are now releasing two papers and a new benchmark to further advance the study of temporal generalization in question-answering models. In “StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models,” we investigate how parametric and retrieval-augmented question-answering models adapt to new information. Our new benchmark, StreamingQA, consists of human-written and automatically generated questions based on time-stamped news articles spanning 14 years. This benchmark allows us to examine the ability of models to adapt and avoid catastrophic forgetting. We found that parametric models can be updated without retraining, while models with outdated underlying LMs underperform compared to those with retrained LMs.
Additionally, in “Internet-augmented language models through few-shot prompting for open-domain question answering,” we explore the use of few-shot prompting and Google Search as a retrieval component to improve the factuality and access to up-to-date information in language models. This approach does not require fine-tuning or additional parameters, making it applicable to any language model. Through our experiments, we discovered that LMs conditioned on information from the web outperform closed-book models in open-domain question-answering tasks.
These advancements contribute to a more comprehensive understanding of how language models can adapt to evolving knowledge and provide accurate answers to a wide range of questions. By considering temporal dynamics and incorporating external knowledge sources, we can enhance the performance and flexibility of AI-powered question-answering models.