Pixart-α: High-Quality and Affordable Text-to-Image Generation
A new era of photorealistic image synthesis has begun with the development of text-to-image (T2I) generative models like DALLE 2, Imagen, and Stable Diffusion. These models have had a significant impact on picture editing, video production, and 3D asset creation. However, these advanced models require a lot of processing power to train, which comes at a high cost. For example, the training of RAPHAEL, the latest model, costs around $3,080,000 and produces 35 tonnes of CO2 emissions, putting a strain on the environment.
To address these limitations, researchers from Huawei Noah’s Ark Lab, Dalian University of Technology, HKU, and HKUST have introduced PIXART-α. This model dramatically reduces the computing requirements for training while maintaining high-quality image generation comparable to state-of-the-art models. They propose three main designs to achieve this:
1. Decomposition of the training plan: This involves breaking down the T2I production problem into three subtasks – learning pixel distribution, text-image alignment, and improving aesthetic appeal. They reduce the learning cost for the first subtask by using a low-cost class-condition model for initialization, followed by pretraining and fine-tuning for the other two subtasks.
2. A productive T2I transformer: They use cross-attention modules to inject text conditions and simplify the computationally demanding class-condition branch based on the Diffusion Transformer (DiT). They also introduce a reparameterization method that allows the modified text-to-image model to leverage ImageNet’s knowledge for better initialization and faster training.
3. High-quality information: Existing text-image pair datasets have limitations, so the researchers propose an autolabeling pipeline using advanced vision-language models to produce captions. The SAM dataset, with its diverse collection of objects, provides captions with high information density, improving text-image alignment learning.
These clever features make PIXART-α highly efficient, requiring just 675 A100 GPU days and $26,000 for training. Compared to Imagen and RAPHAEL, PIXART-α uses significantly less training data volume and time, saving millions of dollars.
In terms of generation quality, PIXART-α outperforms current state-of-the-art models like Stable Diffusion. User research trials have shown better picture quality and semantic alignment. The model also demonstrates its advantage in semantic control.
The development of PIXART-α provides valuable insights for the AI community and makes high-quality T2I models more accessible and affordable for independent academics and companies.
To learn more about PIXART-α, you can check out the paper and project. Join our ML SubReddit, Facebook Community, Discord Channel, and Email Newsletter to stay updated with the latest AI research news and projects.
About the Author:
Aneesh Tickoo, a consulting intern at MarktechPost, is a Data Science and Artificial Intelligence undergraduate student at the Indian Institute of Technology(IIT), Bhilai. With a passion for image processing, Aneesh spends most of his time working on projects that harness the power of machine learning. He enjoys collaborating on interesting projects and connecting with people in the field.
Watch the latest AI research updates on our YouTube Channel. Don’t forget to subscribe!