Training large language models (LLMs) that can naturally handle various tasks without extensive task-specific adjustments has become more popular in natural language processing (NLP). There is still a need to create equally flexible and scalable models for vision. Current success in NLP has made researchers want to create comparable models for vision. Vision models must handle various inputs, so researchers are exploring the idea of training models to handle pictures, 3D, and text. This has shown success and improvement in natural language processing. To build these models, researchers have identified three important factors: data, architecture, and training purpose. It is important that a model can handle more input, increase in size, and effectively deal with different inputs. A recent study by the Swiss Federal Institute of Technology Lausanne (EPFL) and Apple has worked on building a scalable model that can handle different input types effectively. The researchers’ strategy involves training a single integrated Transformer encoder-decoder with a multimodal masked modeling goal, with the aim of expanding to several varied modalities. Through this approach, they believe they can build models that can work on a wide array of tasks and inputs. They have found that the 4M model can effectively handle diverse formats, allowing a single Transformer to be trained on text, bounding boxes, pictures, and neural network features. Additionally, the model can train efficiently by utilizing input and target masking, even though it operates on a vast collection of modalities. This method allows training on different and large-scale datasets without requiring multimodal/multitask annotations. Researchers are hopeful that the 4M model will have great promise for many vision tasks and future developments. You can find the full research paper and project on their website. If you want to stay updated on AI research news, consider joining their social media communities and email newsletter.