Grounding language to vision is a crucial problem for many AI systems in the real world. These systems need to retrieve images or generate descriptions for visually impaired individuals. To accomplish these tasks successfully, models must be able to connect different aspects of language, such as objects and verbs, to images. For instance, models must distinguish between verbs like “catch” and “kick” to differentiate between two images.
Understanding verbs is particularly challenging because it requires recognizing objects and understanding how they relate to each other within an image. To address this difficulty, the SVO-Probes dataset has been introduced. This dataset allows researchers to probe language and vision models specifically for verb understanding.
The SVO-Probes dataset contains 48,000 pairs of image-sentence combinations, testing the understanding of over 400 verbs. Each sentence can be broken down into a Subject, Verb, Object (SVO) triplet and paired with both positive and negative example images. The negative examples differ from the positive examples by changing either the subject, verb, or object. By isolating these specific parts of the sentence, researchers can identify which aspects pose the most difficulty for models. This makes SVO-Probes more challenging than standard image retrieval tasks, where negative examples may have no relation to the query sentence.
Creating the SVO-Probes dataset involves querying an image search using SVO triplets from the Conceptual Captions training dataset. These retrieved images are then filtered to ensure a clean set of image-SVO pairs. To probe the model, sentence annotations are collected that describe each image using the SVO triplet. These sentences are paired with negative images, and the negatives are verified by annotators in the final step.
The performance of multimodal transformers on the SVO-Probes dataset is analyzed. Results show that recognizing verbs is particularly challenging, with an overall accuracy of 60.8% on verbs compared to 67.0% on subjects and 73.4% on objects. Surprisingly, models with weaker image modeling perform better on this dataset than the standard transformer model. This suggests that the standard model may overfit the training set and highlights weaknesses that are not observed on other benchmarks.
In conclusion, multimodal transformers still face difficulties in fine-grained understanding, especially when it comes to verbs. The SVO-Probes dataset aims to drive exploration and improvement in verb understanding for language and vision models. Researchers can access the SVO-Probes benchmark and models on GitHub for further investigation.