In recent years, there have been significant advancements in text-to-image (T2I) generation systems such as DALL-E2, Imagen, Cogview, Latent Diffusion, and others. However, text-to-video (T2V) generation remains a challenging task due to the difficulty of creating high-quality visual content and realistic motion that corresponds to the text. Additionally, it is hard to find large-scale databases of text-video combinations.
Baidu Inc. has conducted research on T2V generation and introduced a method called VideoGen. This method aims to create a high-quality movie from textual descriptions. The researchers start by building a high-quality image using a T2I model. They then utilize a cascaded latent video diffusion module to generate a series of smooth, high-resolution latent representations based on the reference image and text description. If needed, they also incorporate a flow-based approach to enhance the temporal resolution of the latent representation sequence. Finally, they train a video decoder to convert the latent representations into an actual video.
The Benefits of Creating a Reference Image
- Improved visual quality: By using a T2I model to create a reference image, the resulting video has higher visual quality. This method leverages a larger dataset of image-text pairs, which is more diverse and information-rich compared to video-text pair datasets. It is also more efficient during the training phase compared to Imagen Video, which relies on joint training with image-text pairs.
- Learning video dynamics: The cascaded latent video diffusion model can be guided by a reference image, allowing it to learn video dynamics instead of focusing solely on visual content. This approach provides an added benefit compared to methods that only utilize T2I model parameters.
The researchers also note that the text description is not necessary for their video decoder to generate a movie from the latent representation sequence. By training the video decoder on a larger dataset that includes video-text pairs and unlabeled films, this method enhances the smoothness and realism of the video’s motion.
Based on qualitative and quantitative evaluations, VideoGen represents a significant improvement over previous text-to-video generation methods.
To learn more about this research, check out the paper and project. Credit for this research goes to the researchers involved in this project. Don’t forget to join our ML SubReddit, Facebook community, Discord channel, and subscribe to our email newsletter for the latest AI research news and projects.
If you enjoy our work, you will love our newsletter. Subscribe now to stay updated.