Introducing POLYLM: A Multilingual Large Language Model
The Artificial Intelligence (AI) sector is buzzing with excitement over the recent advancements in Large Language Models (LLMs). These models, trained on massive amounts of data, have the ability to understand, reason, and generate text based on natural language instructions. One of the limitations of current LLMs is their focus on English and resource-rich languages. To address this, researchers from DAMO Academy and Alibaba Group have developed a multilingual LLM called POLYLM.
POLYLM-13B and POLYLM-1.7B are two versions of POLYLM that have been released to facilitate usage. These models have been built using a massive dataset of 640B tokens from publicly accessible sources like Wikipedia, mC4, and CC-100. To overcome the issue of insufficient data for low-resource languages, the researchers have employed a curricular learning technique. This technique gradually increases the ratio of high-quality, low-resource languages during training while initially focusing more on English.
In addition to POLYLM, the team has also developed MULTIALPACA, a multilingual instruction dataset that provides high-quality multilingual instruction data for supervised fine-tuning (SFT) phase. Unlike existing methods that rely on manual annotation or machine translation, MULTIALPACA utilizes a self-instruct approach with English seeds, translations, and filtering systems.
To evaluate POLYLM’s multilingual capabilities, the researchers have created a benchmark derived from existing multilingual tasks. This benchmark covers question answering, language understanding, text generation, and cross-lingual machine translation across 15 languages. The team has demonstrated that their pretrained model outperforms open-source models of comparable size in non-English languages. The proposed curriculum training strategy improves multilingual performance while maintaining English proficiency. The use of multilingual instruction data also enhances POLYLM’s ability to handle multilingual zero-shot tasks.
In summary, the team’s contributions include the development of a proficient 13B scale model that performs well in major non-English languages. They have also proposed an advanced curriculum learning approach to transfer general knowledge from English to non-English languages. Additionally, they have introduced MULTIALPACA, a dataset that enhances the ability of LLMs to follow multilingual instructions.
For more information, you can check out the paper and project. Stay updated with the latest AI research news by joining our ML SubReddit, Discord Channel, and Email Newsletter. And don’t forget to explore the 800+ AI Tools in AI Tools Club.
About the Author:
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning. She is a Data Science enthusiast with good analytical and critical thinking skills, and a passion for acquiring new skills and managing work effectively.