Introducing 4M: A Multimodal Training Scheme for Versatile Computer Vision Models
4M, a new multimodal training scheme, could revolutionize the field of computer vision. The 4M model is a unified Transformer encoder-decoder that can handle a wide range of input/output modalities, including text, images, geometric, and semantic modalities, as well as neural network feature maps. By training on a diverse set of modalities, 4M leads to models with several key capabilities. These models can perform a variety of vision tasks, excel when fine-tuned for new tasks, and function as generative models that can be conditioned on different modalities for expressive editing capabilities.
The significance of this new approach to training vision models is its potential to create versatile and scalable foundation models for vision tasks. 4M sets the stage for further exploration in multimodal learning for vision and other domains. This innovative approach hints at the possibility of creating highly versatile computer vision models that could be used in a wide range of applications.