DiffFit: Fine-tuning Large Diffusion Models for Image Generation
In the field of machine learning, modeling complex probability distributions is a significant challenge. Diffusion probabilistic models (DPMs) are designed to learn the inverse of a well-defined stochastic process that progressively destroys information.
Denoising diffusion probabilistic models (DDPMs) have proven their value in areas such as image synthesis, video production, and 3D editing. However, the current state-of-the-art DDPMs are computationally expensive due to their large parameter sizes and frequent inference steps per image. Not everyone has the financial means to cover these costs, so it is important to explore strategies for customizing publicly available, pre-trained diffusion models to individual applications.
Researchers at Huawei Noah’s Ark Lab have conducted a new study using the Diffusion Transformer as a foundation. They have developed DiffFit, a simple and effective fine-tuning technique for large diffusion models. This technique is inspired by recent research in natural language processing (NLP) that showed adjusting the bias term can fine-tune a pre-trained model for downstream tasks. The researchers adapted these tuning strategies for image generation by applying BitFit and incorporating learnable scaling factors to specific layers of the model. The inclusion of scaling factors at strategic places throughout the model is crucial for improving the Frechet Inception Distance (FID) score, which measures the quality of generated images.
The team compared DiffFit with other parameter-efficient fine-tuning strategies like BitFit, AdaptFormer, LoRA, and VPT on eight different downstream datasets. The results showed that DiffFit performs better than these techniques in terms of the number of trainable parameters and the FID trade-off. Additionally, the researchers found that DiffFit can be easily employed to fine-tune low-resolution diffusion models for high-resolution image production at a lower cost by treating high-resolution images as a distinct domain.
In fact, DiffFit outperformed prior state-of-the-art diffusion models on ImageNet 512×512 with just 25 epochs of fine-tuning from a pretrained ImageNet 256×256 checkpoint. DiffFit achieved better FID scores compared to the original DiT-XL/2-512 model, which has significantly more trainable parameters. It also requires 30% less time to train.
Overall, DiffFit provides insight into the efficient fine-tuning of larger diffusion models for picture production. It offers a simple and powerful baseline for parameter-efficient fine-tuning that can be customized for individual applications.
Check out the paper for more details on DiffFit. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research and projects. If you have any questions or feedback, feel free to email us at Asif@marktechpost.com.
🚀 Check Out 100’s AI Tools in AI Tools Club