Senior Software Engineer, Yang Zhao, and Senior Staff Software Engineer, Tingbo Hou, have discovered a new way to quickly generate high-quality images from text on mobile devices. This is a big deal because most current models are too big and slow for mobile use. In their article “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, they introduce a compact and efficient model called MobileDiffusion that can generate images in half a second and only has 520M parameters.
Challenges of existing models arise from their slow, complex design and large number of parameters. While previous research has focused on reducing the number of steps needed to generate images, little attention has been paid to re-designing the architecture for efficiency. In their study, Zhao and Hou carefully examined and redesigned the UNet architecture used in existing models to create MobileDiffusion. This model consists of a text encoder, diffusion UNet, and image decoder.
For the Diffusion UNet, they found that increasing the number of transformers at the bottleneck and using lightweight convolution layers improved efficiency. They also optimized the image decoder by training a variational autoencoder (VAE) and pruning the architecture. Additionally, they implemented a new approach called one-step sampling using DiffusionGAN, which significantly reduces the time needed to generate images.
At the end, Zhao and Hou tested MobileDiffusion and found that it can generate high-quality images in half a second on both iOS and Android devices. This is a major advancement for AI on mobile devices and has the potential to greatly enhance user experience. With the rise of privacy concerns, efficient AI on mobile devices is becoming increasingly important, and MobileDiffusion is paving the way for this new frontier.