Researchers Highlight Bias in Pretraining Data of LLMs
Advancements in Natural Language Processing and Natural Language Generation have made Large Language Models (LLMs) a popular choice for various real-world applications. With their ability to imitate human behavior, these models have made their way into every domain.
However, researchers have noticed that these models exhibit bias due to the composition of their pretraining data. To address this issue, a team of researchers introduced a new dataset and framework called AboutMe, aiming to document the effects of data filtering on text rooted in social and geographic contexts.
New Dataset and Framework – AboutMe
AboutMe is a dataset developed by a collaborative team from the Allen Institute for AI, the University of California, Berkeley, Emory University, Carnegie Mellon University, and the University of Washington. It aims to address the biases in pretraining data and highlights the limitations of current data curation workflows.
The dataset analyzes the social and geographic context of web-scraped text from ‘about me’ sections of websites, shedding light on whose language is represented in web-scraped text.
Implications for Language Models
Using the dataset, the team conducted sociolinguistic analyses that revealed the unintended implications of data filtering on the portrayal of varied viewpoints in language models. The findings emphasize the need for more research on pretraining data curation procedures and their social implications.
By sharing this research, the team aims to raise awareness of the intricate details involved in data filtering during LLM development and advocate for a more comprehensive understanding of the social factors in natural language processing.