A new video generation model has been developed by Google Research. The model is called VideoPoet and uses a large language model (LLM) to generate various types of videos using different types of inputs such as text, images, and audio. VideoPoet is unique because it combines the ability to generate different types of videos within a single model, which is different from other models that rely on separately trained components for each task.
The Model’s Capabilities
VideoPoet is capable of multitasking with tasks such as text-to-video, image-to-video, video-to-audio, stylization, and outpainting. The model can take text as input to generate different types of videos with various styles and motions.
Using Language Models for Training
The advantage of using LLMs for training VideoPoet is that it can reuse scalable efficiency improvements from existing LLM training infrastructure. The model operates on discrete tokens, but video and audio tokenizers can encode and decode video and audio clips as sequences of tokens.
Examples Generated by VideoPoet
Some examples of videos generated by VideoPoet include videos generated from specific text prompts, as well as image-to-video and video stylization examples. The model is also capable of generating audio from videos.
Long Video Generation and Interactive Editing
VideoPoet can generate longer videos by conditioning on the last one second of video and predicting the next one second. It also allows for interactive editing of existing video clips and motion control for both camera and subject.
In conclusion, VideoPoet is a versatile video generation model that has many capabilities and can be used for various video generation tasks. It offers unique features and produces high-quality videos.