Text-to-image diffusion models have revolutionized the field of AI by generating high-quality and realistic pictures using text inputs. These models have found applications in image-to-image translation, controlled creation, and customization. Recently, researchers have been exploring the potential of these models to go beyond 2D pictures and tackle more complex visual tasks with the help of modality-specific training data.

The challenge lies in ensuring consistency across a group of pictures when using image diffusion models for synthesis or editing. Currently, these models do not take consistency into account, resulting in incoherent outcomes. Take, for example, panorama picture modification, where it’s evident that photos have been stitched together.

To address this issue, researchers have proposed a technique called Collaborative Score Distillation (CSD). This technique utilizes the generative prior of text-to-image diffusion models and leverages Stein variational gradient descent (SVGD) to achieve inter-sample consistency. Additionally, they introduce CSD-Edit, a powerful method for consistent visual editing using the instruction-guided picture diffusion model Instruct-Pix2Pix.

The researchers showcase the versatility of their approach through various applications, including panorama picture editing, video editing, and 3D scene reconstruction. They demonstrate how CSD can alter panoramic images with spatial consistency and achieve a balance between instruction accuracy and image consistency. In video editing experiments, CSD-Edit ensures temporal consistency, leading to frame-consistent video editing. Moreover, CSD-Edit enables the generation and editing of 3D scenes, ensuring uniformity across different viewpoints.

