Advancing Text-Guided Image Editing with Imagen Editor and EditBench
Text-to-image generation has experienced significant breakthroughs in recent years. Researchers at Google Research, Su Wang and Ceslee Montgomery, have made notable contributions to this field with their projects Imagen, Parti, and DALL-E 2. One practical application of this technology is text-guided image editing (TGIE), which allows for quick and precise editing of visuals without the need for starting from scratch.
TGIE is a valuable tool for tasks like tweaking objects in vacation photos or refining details on generated images. It also presents an opportunity to enhance the training of foundational models. The key is to generate and combine high-quality synthetic data using TGIE techniques, which can optimize the distribution of training data along various axes.
In their latest work, “Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting,” Su Wang and Ceslee Montgomery introduce Imagen Editor, a state-of-the-art solution for masked inpainting. Masked inpainting involves editing specific areas of an image based on text prompts and overlay masks. They also present EditBench, a method to evaluate the quality of image editing models.
Imagen Editor takes three inputs from the user: the image to be edited, a binary mask indicating the edit region, and a text prompt. By incorporating the user’s intent, Imagen Editor performs localized and realistic edits to the designated areas.
The model relies on three core techniques for high-quality text-guided image inpainting. First, it uses an object detector masking policy to produce masks based on detected objects. This improves alignment between text prompts and masked regions, ensuring that the text input is effectively utilized during training.
Additionally, Imagen Editor enhances high-resolution editing by conditioning on full-resolution inputs through downsampling convolutions. This technique is critical for achieving high fidelity and precise editing results. The model also employs classifier-free guidance to align the generated image with the input text prompt.
To evaluate the performance of Imagen Editor and other models, Su Wang and Ceslee Montgomery created the EditBench dataset. It consists of 240 images, including both generated and natural images. The dataset covers a wide range of language, image types, and levels of text prompt specificity.
EditBench prompts focus on attributes, object types, and scenes, allowing researchers to test the models’ ability to handle fine-grained details. Human evaluation is considered the gold standard for assessing the models’ performance due to the limitations of existing automatic evaluation metrics.
In the human evaluation, Imagen Editor with object masking demonstrated superior performance compared to other models like Stable Diffusion and DALL-E 2 across all evaluation categories. The evaluations confirmed the accuracy of text-image alignment and the realistic rendering of objects and attributes.
Su Wang and Ceslee Montgomery’s research on Imagen Editor and EditBench represents a significant advancement in text-guided image editing. Their techniques improve the quality and control of image inpainting, providing a valuable tool for various applications.