Text-to-Image Generation: Introducing ProFusion
The field of text-to-image generation has seen significant progress in recent years. Researchers have developed large-scale models, like DALL-E and CogView, that can generate high-resolution images based on textual descriptions with exceptional fidelity. These models have not only revolutionized text-to-image generation but have also had a profound impact on image manipulation and video generation.
However, these large-scale models face challenges when it comes to creating novel and unique concepts specified by users. To address this, researchers have explored methods to customize pre-trained text-to-image generation models. One approach involves fine-tuning the models using a limited number of samples and employing regularization techniques to prevent overfitting. Another approach is to encode the user’s novel concept into a word embedding, obtained through optimization or an encoder network. These methods enable customized generation while meeting additional requirements specified by the user.
Despite these advancements, there have been concerns about the limitations of customization using regularization methods. It is suspected that these techniques may unintentionally restrict the generation of fine-grained details.
To overcome this challenge, a novel framework called ProFusion has been proposed. ProFusion consists of a pre-trained encoder called PromptNet and a sampling method called Fusion Sampling. Unlike previous methods, ProFusion eliminates the need for regularization during training and effectively addresses the challenge during inference using Fusion Sampling.
Fusion Sampling involves two stages at each timestep. The fusion stage encodes information from the input image embedding and the conditioning text, resulting in a noisy partial outcome. The refinement stage updates the prediction based on hyperparameters chosen by the user. This approach preserves fine-grained information from the input image while conditioning the output on the input prompt. It also saves training time and eliminates the need for hyperparameter tuning related to regularization methods.
ProFusion outperforms other techniques in terms of preserving fine-grained details, especially facial traits. This regularization-free framework offers state-of-the-art quality in text-to-image generation.
If you’re interested in learning more about ProFusion, you can check out the paper and the Github link. And don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and cool AI projects. If you have any questions or feedback, feel free to email us. Plus, you can explore 100’s of AI tools in the AI Tools Club.
By Daniele Lorenzi