On-Device Text-to-Image Generation with MediaPipe Dispersion Plugins

Diffusion models have become extremely popular in text-to-image generation. They have improved the quality of images, performance of inference, and expanded our creative possibilities. However, managing generation effectively is still a challenge, especially in situations that are difficult to describe in words.

To address this, Google researchers have developed MediaPipe dispersion plugins. These plugins allow users to control on-device text-to-image generation. In this study, we build upon our previous work on GPU inference for large generative models on the device itself and provide low-cost solutions for programmable text-to-image creation. These solutions can be integrated into existing diffusion models and their Low-Rank Adaptation (LoRA) variations.

In diffusion models, iterative denoising is used to produce images. Each iteration starts with a noisy image and ends with a target image. Text prompts have greatly enhanced the process of generating images by helping with language understanding. Text embedding is linked to the model for text-to-image production through cross-attention layers. However, conveying certain details, such as object position and pose, can be more challenging with text prompts.

To address this challenge, researchers have introduced control information from a condition image into diffusion using additional models. There are three frequently used methods for generating controlled text-to-image output: Plug-and-Play, ControlNet, and T2I Adapter.

Plug-and-Play uses a copy of the diffusion model and a denoising diffusion implicit model (DDIM) inversion approach to encode the state from an input image. ControlNet constructs a trainable duplicate of the encoder of a diffusion model to encode conditioning information. T2I Adapter, on the other hand, is a smaller network that only takes a conditioned picture as input and delivers comparable results in controlled generation.

The MediaPipe diffusion plugin is a standalone network that can be connected to a trained baseline model. It is portable and can be run independently on mobile devices at minimal additional cost. The plugin adds conditioned features to the encoder of a diffusion model, enabling conditioned image production.

The plugin network has only 6M parameters, making it a relatively simple model. It uses MobileNetv2, which employs depth-wise convolutions and inverted bottlenecks for rapid inference on mobile devices.

To summarize, the MediaPipe dispersion plugin is a convenient and scalable solution for conditioned image generation. It can be easily integrated into existing models and provides an efficient way to generate text-to-image output. It is compatible with mobile devices and offers low-cost programmable text-to-image creation.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...