Probabilistic diffusion models are a type of generative model that has become very important in computer vision research. Unlike other generative models, like Variational Autoencoder (VAE), Generative Adversarial Networks (GANs), and vector-quantized approaches, diffusion models have a unique way of generating data. They use a fixed Markov chain to map the latent space, which helps capture the complex structure of a dataset. These models have shown impressive capabilities in generating detailed and diverse images, leading to advancements in computer vision tasks like image synthesis, editing, translation, and text-to-video generation.
The diffusion models consist of two main components: the diffusion process and the denoising process. In the diffusion process, Gaussian noise is added to the input data gradually, transforming it into pure Gaussian noise. The denoising process aims to recover the original data from its noisy state using a series of learned inverse diffusion operations. Usually, a U-Net is used to predict the noise removal at each denoising step.
Most research on diffusion models focuses on using pre-trained diffusion U-Nets for different applications, without exploring the internal characteristics of the diffusion U-Net itself. However, a joint study from the S-Lab and Nanyang Technological University takes a different approach. They investigate the effectiveness of the diffusion U-Net in the denoising process by looking at the generation process in the Fourier domain.
The study found that there is a gradual modulation of low-frequency components during denoising, while high-frequency components show more dynamic changes. Low-frequency components represent global structure and characteristics of an image, while high-frequency components capture rapid changes like edges and textures. The denoising process needs to remove noise without altering these important details.
To understand the specific contributions of the U-Net architecture in the diffusion framework, the study shows that the primary backbone of the U-Net plays a significant role in denoising, while skip connections introduce high-frequency features that help recover fine-grained semantic information. However, these high-frequency features can sometimes lead to abnormal image details during the inference phase.
To address this issue, the researchers propose a new approach called “FreeU.” This approach introduces modulation factors to balance the contributions of the primary backbone and skip connections. These factors enhance the denoising process while preventing over-smoothing of textures. The FreeU framework can be integrated with existing diffusion models and has been shown to improve the quality of generated outputs in tasks like text-to-image and text-to-video generation.
The researchers conducted comprehensive experiments using foundational models for benchmarking, and the results showed a significant enhancement in the quality of generated outputs when FreeU was applied during the inference phase.
In conclusion, FreeU is a novel AI technique that improves the output quality of generative models without additional training or fine-tuning. It has been shown to be effective in enhancing both intricate details and the overall visual fidelity of generated images. If you want to learn more about FreeU, you can check out the links provided in the article.