MediaPipe Diffusion Plugins: Empowering On-Device Text-to-Image Generation
In the field of artificial intelligence, diffusion models have achieved remarkable success in text-to-image generation. These models have improved image quality, enhanced inference performance, and expanded our creative possibilities. However, effectively controlling the generation process remains a challenge, especially when dealing with conditions that are not easily described using text alone.
Today, we are excited to announce the launch of MediaPipe diffusion plugins, which allow for controllable text-to-image generation on mobile devices. Building upon our previous work on GPU inference for on-device generative models, we have developed low-cost solutions for controllable text-to-image generation that can be seamlessly integrated into existing diffusion models and their Low-Rank Adaptation (LoRA) variants.
Background: How Diffusion Models Work
Diffusion models approach image generation as an iterative denoising process. Starting with a noise image, the model gradually removes noise to reveal the desired image. Adding language understanding through text prompts has proven to be an effective way to improve image generation. In text-to-image generation, the text embedding is connected to the model through cross-attention layers. However, certain information, such as the position and pose of an object, is difficult to convey through text prompts alone.
To address this challenge, researchers have introduced additional models into the diffusion process to incorporate control information from a condition image. Common approaches to controlled text-to-image generation include Plug-and-Play, ControlNet, and T2I Adapter. However, these models have limitations in terms of size, efficiency, and portability.
Introducing the MediaPipe Diffusion Plugins
To make controlled text-to-image generation efficient, customizable, and scalable, we have developed the MediaPipe diffusion plugin. This plugin is a separate network that can easily be connected to pre-trained diffusion models. It is trained from scratch and is designed to run on mobile devices with minimal computational cost.
Key Features of the MediaPipe Diffusion Plugin
The MediaPipe diffusion plugin stands out for its lightweight design, with only 6M parameters. It utilizes depth-wise convolutions and inverted bottlenecks from MobileNetv2 for fast inference on mobile devices. The plugin captures multiscale features from a condition image and adds them to the encoder of a diffusion model. This additional conditioning signal enhances image generation in text-to-image models.
Efficient and Portable Image Generation
Unlike ControlNet, which requires running at every diffusion step, the MediaPipe plugin only needs to run once for each image generation. This saves computational resources and improves inference efficiency. Furthermore, the MediaPipe diffusion plugin has been tested on various mobile devices, showing consistent performance across different platforms.
Evaluation and Results
We conducted a quantitative study on the face landmark plugin to evaluate its performance. The results showed that both the ControlNet and MediaPipe diffusion plugin achieved significantly better sample quality compared to the base model. Additionally, the MediaPipe plugin demonstrated superior inference efficiency, making it a preferred choice for on-device text-to-image generation.
Conclusion: Empowering Generative AI Applications
With the introduction of MediaPipe diffusion plugins, we enable more flexible applications of generative AI. By running text-to-image generation and plugins directly on mobile devices, we enhance the control and customization of image generation. Our portable plugins can be seamlessly integrated into pre-trained diffusion models, allowing for efficient and scalable text-to-image generation.
In conclusion, MediaPipe diffusion plugins bring us closer to achieving controllable text-to-image generation on mobile devices. We are excited about the possibilities this technology offers and look forward to further advancements in generative AI.
We would like to express our gratitude to… (include acknowledgments as per your requirements)