Introducing RO-ViT: Improving Open-Vocabulary Object Detection with Vision Transformers
Detecting objects in images is important for AI applications like autonomous agents and shopping systems. But current object detectors are limited because they don’t have enough training data for all the different objects they might encounter. That’s where the open-vocabulary detection task (OVD) comes in. It uses image-text pairs to train detectors and allows them to predict a wide range of objects.
Researchers at Google have been exploring the use of vision transformers (ViTs) for open-vocabulary detection. While existing methods rely on pre-trained vision-language models (VLMs), these models don’t fully understand the concept of objects during pre-training. So the researchers developed a new approach called RO-ViT: Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers.
The key idea behind RO-ViT is to incorporate locality information into the image-text pre-training. They do this by using a new positional embedding scheme called “cropped positional embedding” (CPE). Instead of using full-image positional embeddings, they randomly crop and resize a region and use it as the positional embedding during pre-training. This aligns better with the region-level recognition required for open-vocabulary detection.
They also replace the softmax cross entropy loss with focal loss in contrastive image-text learning. Focal loss allows the model to learn from more challenging examples and provides finer control over weighting hard examples. These changes improve the performance of the open-vocabulary detector.
To further enhance the open-vocabulary detection, the researchers leverage advancements in novel object proposals. They use centerness scores instead of binary classification scores to better detect novel objects during the proposal stage. This helps prevent the model from classifying novel objects as background.
In evaluations, RO-ViT outperformed existing approaches on the LVIS open-vocabulary detection benchmark. It achieved higher average precision scores for rare categories and mask AP scores compared to other ViT-based and CNN-based models. RO-ViT also showed improvements in image-text retrieval tasks, outperforming other models on the MS-COCO and Flickr30K benchmarks.
In conclusion, RO-ViT is a simple but effective method for improving open-vocabulary object detection with vision transformers. By incorporating region-aware pre-training, using cropped positional embeddings, and leveraging novel object proposals, RO-ViT achieves better performance on various benchmarks. This research demonstrates the potential of vision transformers for building advanced object detection models. To learn more about RO-ViT and access the code, check out the research paper “RO-ViT: Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers” presented at CVPR 2023.