Diffusion models have become extremely popular in text-to-image generation. They have improved the quality of images, performance of inference, and expanded our creative possibilities. However, managing generation effectively is still a challenge, especially in situations that are difficult to describe in words.
To address this, Google researchers have developed MediaPipe dispersion plugins. These plugins allow users to control on-device text-to-image generation. In this study, we build upon our previous work on GPU inference for large generative models on the device itself and provide low-cost solutions for programmable text-to-image creation. These solutions can be integrated into existing diffusion models and their Low-Rank Adaptation (LoRA) variations.
In diffusion models, iterative denoising is used to produce images. Each iteration starts with a noisy image and ends with a target image. Text prompts have greatly enhanced the process of generating images by helping with language understanding. Text embedding is linked to the model for text-to-image production through cross-attention layers. However, conveying certain details, such as object position and pose, can be more challenging with text prompts.
To address this challenge, researchers have introduced control information from a condition image into diffusion using additional models. There are three frequently used methods for generating controlled text-to-image output: Plug-and-Play, ControlNet, and T2I Adapter.
Plug-and-Play uses a copy of the diffusion model and a denoising diffusion implicit model (DDIM) inversion approach to encode the state from an input image. ControlNet constructs a trainable duplicate of the encoder of a diffusion model to encode conditioning information. T2I Adapter, on the other hand, is a smaller network that only takes a conditioned picture as input and delivers comparable results in controlled generation.
The MediaPipe diffusion plugin is a standalone network that can be connected to a trained baseline model. It is portable and can be run independently on mobile devices at minimal additional cost. The plugin adds conditioned features to the encoder of a diffusion model, enabling conditioned image production.
The plugin network has only 6M parameters, making it a relatively simple model. It uses MobileNetv2, which employs depth-wise convolutions and inverted bottlenecks for rapid inference on mobile devices.
To summarize, the MediaPipe dispersion plugin is a convenient and scalable solution for conditioned image generation. It can be easily integrated into existing models and provides an efficient way to generate text-to-image output. It is compatible with mobile devices and offers low-cost programmable text-to-image creation.