RedPajama: Advancing AI with Fully Open-Source Language Models for Research and Applications

AI News

RedPajama: Advancing AI with Fully Open-Source Language Models for Research and Applications

Jimmy W.

July 11, 2023

RedPajama: Advancing AI with Fully Open-Source Language Models for Research and Applications

RedPajama: Advancing Open-Source AI Models for Customization and Research

The field of AI has seen tremendous advancements with the introduction of foundation models. However, these models are only partially open-source and can only be accessed through commercial APIs. This restricts their use and limits research and customization opportunities. RedPajama, a collaborative project involving institutions like Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, MILA Québec AI Institute, and Together, aims to change that by creating leading, fully open-source models.

The first step in the RedPajama project was to reproduce the LLaMA training dataset, and this has already been completed successfully. The emergence of open-source models in recent times has created a movement similar to what Linux did for operating systems. Stable Diffusion demonstrated that open-source models can compete with commercial offerings and foster creativity through community participation. Now, a similar movement is happening in the realm of large language models, with semi-open models like LLaMA, Alpaca, Vicuna, and Koala, as well as fully open models like Pythia, OpenChatKit, Open Assistant, and Dolly being released.

RedPajama’s goal is to develop a reproducible, fully-open, leading language model with three key components: pre-training data, base models, and instruction-tuning data and models. The project has already released the first component, pre-training data, which is a 1.2 trillion token fully-open dataset based on the LLaMA paper. The starting point for RedPajama is LLaMA, the leading open base model suite. This model has been trained on a large dataset that underwent careful quality filtering. Despite its quality, LLaMA and its derivatives are currently restricted to non-commercial research purposes. RedPajama aims to change that by creating a fully open-source version of LLaMA, making it accessible for commercial applications and providing a more transparent research pipeline.

The RedPajama Dataset, which can be downloaded from Hugging Face, consists of a 1.2 trillion token dataset. It also includes a smaller random sample. The dataset is divided into seven data slices, each of which has gone through meticulous data pre-processing and filtering to ensure high quality. For example, the CommonCrawl data slices were processed and filtered to select pages similar to Wikipedia. The GitHub data was filtered based on licenses and quality, while the arXiv data consisted of scientific articles with boilerplate removed. Similar filtering methods were used for the Books, Wikipedia, and StackExchange subsets.

The RedPajama project is collaborating with the Meerkat project to release a Meerkat dashboard and embeddings for interactive analysis of the GitHub subset of the corpus. The next step in the project is to train a robust base model using the reproduced pre-training data. The Oak Ridge Leadership Computing Facility is supporting the project through the INCITE program, and a full suite of models will be made available soon. The team is also excited to tune the models based on instructions, inspired by the success of Alpaca. They have already received hundreds of thousands of natural user instructions via OpenChatKit, which will be used to release instruction-tuned versions of the RedPajama models.

To stay updated on the latest AI research news and cool AI projects, join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter. Feel free to reach out to us at Asif@marktechpost.com if you have any questions or if we missed anything.

Don’t forget to check out the RedPajama base dataset and RedPajama Github. And if you’re looking for AI tools, visit the AI Tools Club to explore hundreds of options.

About the author: Niharika is a highly enthusiastic individual currently pursuing her B.Tech in Machine Learning and AI at the Indian Institute of Technology (IIT), Kharagpur. She is passionate about staying updated with the latest developments in these fields and enjoys reading about them.

Source link

LEAVE A REPLY Cancel reply