Research on State Space Models (SSMs) and Transformers
In recent years, State Space Models (SSMs) and Transformers have become critical components for sequential modeling. The challenge has been to optimize the scalability of SSMs, which have shown potential but have not yet surpassed the dominance of Transformers.
Scaling Capabilities of SSMs
SSMs have gained attention for blending the characteristics of RNNs and CNNs, rooted in control theory. Recent breakthroughs have helped scale deep SSMs to billions of parameters, ensuring computational efficiency and robust performance. Mamba, an extension of SSMs, introduces linear-time inference and hardware-aware design, mitigating the impact of sequential recurrence. With state compression and selective information propagation, Mamba is emerging as a promising sequence modeling backbone across diverse domains.
Combining MoE with SSMs
A team of researchers has proposed combining a Mixture of Experts (MoE) with SSMs to enhance scaling. The newly developed model, MoE-Mamba, achieves remarkable performance, outperforming Mamba and Transformer-MoE. It reaches the same performance as Mamba in fewer training steps while preserving Mamba’s inference performance gains against the Transformer. The research also delves into enhancing the Mamba architecture and explores conditional computation in Mamba’s block design, anticipating more efficient scaling to larger language models.
The study introduces MoE-Mamba, a model resulting from the integration of MoE with the Mamba architecture. MoE-Mamba surpasses both Mamba and Transformer-MoE, achieving parity with Mamba in fewer training steps while maintaining Mamba’s inference superiority over the Transformer.