Introducing Pic2Word: Mapping Pictures to Words for AI-Generated Image Retrieval
Image retrieval is a crucial aspect of search engines. Users often rely on either images or text as a query to find the desired image. However, describing an image accurately using words can be difficult. For example, when searching for a fashion item, users may want a specific attribute, like the color of a logo or the logo itself, which is different from what they find on a website. But finding that item in an existing search engine can be challenging because describing it precisely by text is not easy.
To address this problem, we have developed a method called composed image retrieval (CIR). CIR allows users to retrieve images based on a query that combines both an image and a text sample to provide instructions on how to modify the image to fit the intended retrieval target. This way, CIR enables precise retrieval of the target image by combining both image and text.
However, traditional CIR methods require a large amount of labeled data, which is costly and limits their ability to work with different datasets. To overcome this challenge, we propose a task called zero-shot composed image retrieval (ZS-CIR). In ZS-CIR, we aim to build a single CIR model that can perform various CIR tasks without requiring labeled triplet data. Instead, we train the model using large-scale image-caption pairs and unlabeled images, making it easier to collect data at scale. We also provide the code for reproducibility and further advancements.
Our method, Pic2Word, leverages the language capabilities of the language encoder in the contrastive language-image pre-trained model (CLIP). The network is optimized using the vision-language contrastive loss to ensure the visual and text embedding spaces are as close as possible. This allows us to treat the query image as if it were a word, enabling flexible composition of image features and text descriptions. We train the mapping network using unlabeled images only, making it more efficient.
We conducted various experiments to evaluate the performance of Pic2Word on different CIR tasks. For domain conversion, where the image needs to be converted into a new desired image domain or style, our approach outperformed other methods. We also evaluated the composition of fashion attributes, such as the color of clothing, logo, and sleeve length, using the Fashion-IQ dataset. Our method performed better than supervised baselines with smaller backbones, highlighting its effectiveness.
In conclusion, Pic2Word is a powerful method for mapping pictures to words for zero-shot composed image retrieval. By training on an image-caption dataset, we can build a highly effective CIR model without the need for annotated triplets. Future research could involve utilizing caption data to train the mapping network, further improving the performance of the model.
Note: This article was written by Kuniaki Saito and Kihyuk Sohn from Google Research, with contributions from other researchers.