The Field of Generative Models: A Breakthrough in Visual Synthesis
The field of generative models has recently gained a lot of attention for its advancements in visual synthesis. While previous work has shown that high-quality image generation is possible, the challenge of generating videos has proven to be more difficult. Practical applications, such as feature films, cartoons, and short videos on apps like TikTok, require videos of varying lengths, ranging from 21 to 90 minutes.
To tackle this challenge, Microsoft’s research team has developed an innovative architecture for generating long videos. Unlike existing methods that generate videos sequentially, their approach takes a coarse-to-fine process. This means that the video is generated simultaneously at the same granularity, starting with the keyframes and then filling in the material between frames using local diffusion models.
The team’s architecture, called NUWA-XL, is the first model that is directly trained on long films, bridging the gap between training and generating videos. This approach allows for parallel inference, significantly reducing the time required to generate lengthy videos. In fact, NUWA-XL accelerates inference by 94.26% when producing 1024 frames.
To ensure the effectiveness of the model and set a standard for video creation, the research team created a new dataset called FlintstonesHD. This dataset serves as a benchmark for evaluating the performance of NUWA-XL.
The team also developed two methods to enhance the performance of the architecture: Temporal KLVAE (T-KLVAE) and Masked Diffusion in Time (MTD). T-KLVAE transforms input images into a low-dimensional latent representation, reducing the computational burden. MTD is a foundational diffusion model that allows for both global and local diffusion, improving the overall quality of video generation.
While NUWA-XL shows promising results, there are still some limitations. The model has only been validated on publicly available cartoons, and its effectiveness on open-domain long videos, such as movies and TV episodes, is yet to be determined. Additionally, training the model on long movies poses a challenge due to the lack of data. Lastly, the improved inference speed of NUWA-XL requires a powerful graphics processing unit (GPU) for parallel processing.
Overall, the research team’s development of NUWA-XL and its associated methods represents a significant advancement in the field of generative models. By tackling the challenge of generating long videos, they have paved the way for new possibilities in visual synthesis. To learn more about their research, including the detailed methods and results, you can check out their paper and project.
Credit for this groundbreaking research goes to the dedicated researchers involved in this project. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and exciting projects.