Improving Vision and Language Models with Synthetic Data
Powerful machine-learning algorithms, called vision and language models, have shown great results in generating captions and summarizing videos. However, these models struggle with understanding concepts beyond object identification. For example, they might recognize a cup and table in an image but fail to understand that the cup is sitting on the table.
To address this shortcoming, researchers from MIT, the MIT-IBM Watson AI Lab, and other institutions have developed a new technique using computer-generated data. They created a synthetic dataset of images that depict various scenarios, object arrangements, and human actions, along with detailed text descriptions. This dataset helps vision and language models learn concepts more effectively and still make accurate predictions when presented with real images.
By testing the models on concept understanding, the researchers found that their technique improved accuracy by up to 10 percent. This advancement has potential applications in fields like e-commerce and healthcare, where systems automatically caption videos or provide natural language answers to image-related questions.
Vision and language models are typically trained to identify objects in a scene but may overlook object attributes and positional relationships. The researchers used the contrastive learning method to fine-tune these models with synthetic data. They created diverse 3D environments and objects in computer-generated videos, added human avatars interacting with the objects, and generated realistic images paired with detailed captions.
Using synthetic data offers advantages such as diversity, affordable generation, photorealism, and privacy preservation. Additionally, the automatic generation of data allows for mass production.
Although there is a risk of models forgetting past knowledge when fine-tuned with synthetic data, the researchers implemented techniques to counteract this, such as adjusting the synthetic data to match real images and making adjustments to the model after fine-tuning.
By utilizing their synthetic dataset and fine-tuning strategy, the researchers significantly improved the ability of vision and language models to accurately recognize concepts without forgetting previous knowledge.
Moving forward, the researchers aim to enhance the visual quality and diversity of synthetic data and investigate the scalability limits of model improvement with larger and more diverse datasets.
This research is funded by the U.S. Defense Advanced Research Projects Agency, the National Science Foundation, and the MIT-IBM Watson AI Lab.