Probing Language and Vision Models for Verb Understanding
Artificial Intelligence (AI) models have to connect language to visuals, which is crucial for real-world applications like generating image descriptions for visually impaired people. To do this, the AI must understand different aspects of language, such as objects and verbs, and relate them to images.
Multimodal transformer models have been successful in various language and vision tasks. However, it’s unclear if these models truly understand language and visuals together. Prior research shows that models can perform well on tests without actually understanding the connections between language and images.
To address this, a team introduced the SVO-Probes dataset to evaluate how well models understand verbs. The dataset contains 48,000 image-sentence pairs and tests understanding for over 400 verbs. It includes both positive and negative examples, where only one part of the sentence is changed. This helps identify which parts of a sentence are the most challenging for the model.
In creating SVO-Probes, the team used an image search to find pairs from a training dataset, filtered for accuracy, and then paired each sentence with an image. They then tested how well multimodal transformers can classify examples as positive or negative, finding that recognizing verbs is especially challenging for these models.
Surprisingly, models with weaker image modeling perform better on the dataset. This suggests that the standard models may overfit the training set. Despite good performance on other tasks, these models seem to struggle with detailed verb understanding.
The team hopes that SVO-Probes will encourage further exploration of verb understanding in AI models. They’ve made the benchmark and models available on GitHub for other researchers to use.