Since prehistoric times, people have used sketches for communication and documentation. In the past decade, researchers have made significant progress in understanding the various applications of sketches, including classification, synthesis, visual abstraction, style transfer, and continuous stroke fitting. One area that has received less attention is the expressive potential of sketches in image retrieval.
Recently, scientists have been training systems to harness the evocative power of sketches for object detection in scenes. This is a groundbreaking development because it allows us to pinpoint specific objects in a scene, such as a particular zebra in a herd.
The researchers have set some goals for their model. They want it to be successful without any prior knowledge of the expected results (zero-shot) and without the need for extra boundary boxes or class labels (fully supervised). This makes their approach novel and unique.
To achieve this, the researchers have switched object detection from a closed-set to an open-vocab configuration. They utilize prototype learning instead of classification heads, with encoded sketch features serving as the support set. The model is then trained using a multi-category cross-entropy loss across all possible categories or instances in a weakly supervised object detection environment.
While object detection operates on an image level, sketch-based image retrieval (SBIR) is trained using pairs of sketches and photos of individual objects. To bridge the gap between object-level and image-level characteristics, the researchers have developed a training paradigm.
The researchers have made several contributions with their work. They have cultivated the expressiveness of human sketching for object detection and built an object detector that can understand the intended message of a sketch. The detector is capable of traditional category-level detection as well as instance- and part-level detection. They have also developed a novel prompt learning configuration that combines CLIP and SBIR to create a sketch-aware detector that operates in a zero-shot fashion without bounding box annotations or class labels.
The findings of their study have outperformed supervised and weakly supervised object detectors in zero-shot setups, as demonstrated on widely-used object detection datasets.
In conclusion, the researchers have successfully incorporated the expressive nature of sketches into object detection. Their framework allows for more accurate and nuanced detection of objects in a scene. The combination of CLIP and SBIR has proven to be a powerful and effective approach.