Pangu-Σ: A Trillion-Parameter Sparse Architecture for Large Language Models

AI News

Pangu-Σ: A Trillion-Parameter Sparse Architecture for Large Language Models

Jimmy W.

July 11, 2023

Pangu-Σ: A Trillion-Parameter Sparse Architecture for Large Language Models

Title: Pangu-Σ: A Large Language Model with Trillion Parameters

Introduction:
Large Language Models (LLMs) have shown remarkable abilities in natural language processing and reasoning. They have the potential to excel in various tasks with the help of extensive textual data. Several trillion-parameter models have been developed, including Pangu-Σ, which is introduced in this article.

Scaling the Model:
To enhance LLM performance, the size of the model is crucial. Sparse architectures like Mixture of Experts (MoE) allow scaling up without incurring high computational costs. However, challenges like imbalanced workloads and global communication delay exist. Developing a trillion-parameter sparse model with good performance and training efficiency is a difficult challenge.

Scaling the System:
DeepSpeed 4 framework can enable training models with trillion parameters. The main constraint here is the limited compute budget, which limits the number of available accelerating devices. Techniques like tensor parallelism, pipeline parallelism, and zero redundancy optimizer help tackle this issue. However, bandwidth limitations between host and device machines hinder optimal performance.

Pangu-Σ: An Overview:
In this study, Huawei researchers introduce Pangu-Σ, a large language model with a sparse architecture and 1.085 trillion parameters. The model is designed using MindSpore 5 framework and trained on a cluster with 512 Ascend 910 AI Accelerators over 100 days. Pangu-Σ employs the Random Routed Experts’ (RRE) Transformer decoder architecture, which offers improved performance compared to traditional MoE models.

Benefits of Pangu-Σ:
The RRE architecture allows easy extraction of sub-models for various downstream applications. Pangu-Σ demonstrates superior performance in conversation, translation, code production, and general natural language understanding. The Expert Computation and Storage Separation (ECSS) mechanism ensures efficiency and scalability of the training system, resulting in a significantly faster training throughput.

Conclusion:
Pangu-Σ, with its sparse architecture and trillion parameters, outperforms previous models in several downstream tasks. The model exhibits superior performance in various application domains. Future research and development in large language models will continue to focus on scaling efficiency and optimal system performance to harness their full potential.

Note: This article includes sponsored content and additional resources for readers interested in exploring related AI research and applications.

Source link

LEAVE A REPLY Cancel reply