Posted by Kuniaki Saito, Student Researcher, Google Research, Cloud AI Team, and Kihyuk Sohn, Research Scientist, Google Research
Image retrieval is important in search engines. Users often search for desired target images using either image or text as a query. However, text-based retrieval has limitations in accurately describing images. For example, when searching for a fashion item, users may want an item with specific attributes, like the color of a logo, that may not be easy to describe in words. Existing search engines struggle with this because describing fashion items accurately using text is challenging. To address this, we introduce composed image retrieval (CIR) which combines both image and text to retrieve target images accurately.
However, CIR methods require large amounts of labeled data, which can be expensive to collect. Models trained on this data are often limited to specific use cases and cannot generalize well to different datasets. To overcome these challenges, we propose a task called zero-shot CIR (ZS-CIR). In ZS-CIR, we aim to build a single CIR model that can perform various tasks without requiring labeled triplet data. Instead, we train the model using large-scale image-caption pairs and unlabeled images, which are easier to collect. We have also released the code to encourage further advancement in this field.
Our method, called Pic2Word, uses the language capabilities of the contrastive language-image pre-trained model (CLIP) to generate meaningful language embeddings. We use a lightweight mapping sub-module in CLIP to map the input picture to a word token in the textual input space. The network is then optimized using the vision-language contrastive loss to ensure close alignment between visual and text embeddings. By treating the query image as a word, we can flexibly combine image features and text descriptions using the language encoder.
We conducted various experiments to evaluate Pic2Word’s performance on different CIR tasks. One task is domain conversion, where we convert an image into a different domain or style. We also evaluated fashion attribute composition, such as color or logo. Our approach outperformed baselines and achieved results comparable to supervised models.
In conclusion, Pic2Word is a powerful method for zero-shot composed image retrieval. By training on an image-caption dataset, we can build an effective CIR model without labeled triplet data. One future research direction may involve utilizing caption data in training the mapping network. We would like to acknowledge the researchers and team members involved in this project for their valuable contributions.