Researchers have made significant progress in training diffusion models using reinforcement learning (RL) to improve prompt-image alignment and optimize various objectives. They have introduced denoising diffusion policy optimization (DDPO), which treats denoising diffusion as a multi-step decision-making problem. This approach allows for fine-tuning Stable Diffusion on challenging downstream objectives.
By training diffusion models directly on RL-based objectives, researchers have seen notable improvements in prompt-image alignment and optimization of objectives that are difficult to express through traditional methods. DDPO is a class of policy gradient algorithms that have been designed for this purpose. To enhance prompt-image alignment, the research team has incorporated feedback from a large vision-language model known as LLaVA. Through RL training, they have made remarkable progress in aligning prompts with generated images. Interestingly, the models have shown a shift towards a more cartoon-like style, which may be influenced by the prevalence of such representations in the pretraining data.
The results achieved using DDPO for various reward functions are promising. Evaluations on compressibility, incompressibility, and aesthetic quality have shown significant enhancements compared to the base model. The researchers have also highlighted the generalization capabilities of the RL-trained models, which extend to unseen animals, everyday objects, and novel combinations of activities and objects. However, they acknowledge the potential challenge of over-optimization when fine-tuning learned reward functions, as it can lead to models exploiting rewards non-usefully and destroying meaningful image content.
In addition, the researchers have observed that the LLaVA model is susceptible to typographic attacks. RL-trained models can generate loosely similar text to fool LLaVA in prompt-based alignment scenarios, giving incorrect results in terms of the number of animals mentioned.
To summarize, the introduction of DDPO and the use of RL training for diffusion models represent significant advancements in improving prompt-image alignment and optimizing diverse objectives. These developments have led to advancements in compressibility, incompressibility, and aesthetic quality. However, challenges such as reward over-optimization and vulnerabilities in prompt-based alignment methods require further investigation. These findings open up new research and development opportunities in diffusion models, particularly in image generation and completion tasks.
Check out the paper, project, and GitHub link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news and cool AI projects. If you have any questions or if we missed anything, feel free to email us at Asif@marktechpost.com.
Check out 100’s AI tools in the AI Tools Club