In various applications, Transformer models are used from powerful multi-accelerator clusters to individual mobile devices. Developers train fundamental models like PaLM 2, Llama, and ViTs in different sizes to meet the varied requirements for inference. However, training these models comes with higher costs, limiting the supported model sizes.
Large foundational models are used in various situations such as quick responses on mobile phones or handling batches on multi-cluster GPUs for large-scale web applications. These models offer independently trained models in different sizes to accommodate different circumstances. To cater to a wide range of applications, these model sizes are typically grouped on a logarithmic scale in a roughly linear fashion.
A group of researchers from Google Research, the University of Texas at Austin, the University of Washington, and Harvard University have introduced MatFormer, a Transformer architecture explicitly designed for adaptability. MatFormer allows for the creation of integrated models that can generate multiple smaller submodels without additional training.
MatFormer incorporates a nested sub-structure within the standard Transformer and optimizes all the granularities to produce a single, universal elastic model. The researchers deliberately mix various levels of information in different layers of the MatFormer model to produce accurate submodels without incurring extra training costs. Through this approach, they combine and adjust the complexity of the model across different layers.
The nested structure focuses on the hidden representations of the Feed Forward Network (FFN) block, enhancing the model’s capabilities by ordering the attention heads from most to least significant. By distributing the more significant heads among a larger number of submodels, training is accelerated by 15%. This method also allows for the extraction of smaller submodels while maintaining accuracy.
The researchers found that they could generate a substantial number of accurate smaller models without further optimization by selecting different levels of detail for each MatFormer layer. The effectiveness of MatFormer was studied across different model types, modalities, and scales, with comparable validation loss and one-shot downstream performance to independently trained counterparts. MatFormer proves to be a robust and reliable architecture for various AI tasks.