**Introducing MotionDirector: Customizing Motion in Text-to-Video Generation**
Text-to-video diffusion models have advanced significantly recently. Now, users can create realistic or imaginative videos just by providing textual descriptions. These models have also been fine-tuned to generate images that match specific appearances, styles, and subjects. However, there is still room for exploration in customizing motion in text-to-video generation. Users often want to create videos with specific motions, like a car moving forward and then turning left. That’s why it’s important to adapt diffusion models to create more specific content that caters to users’ preferences.
The authors of this paper have proposed MotionDirector, a technique that helps foundation models achieve motion customization while maintaining appearance diversity. MotionDirector uses a dual-path architecture to train models to learn the appearance and motions separately from single or multiple reference videos. This makes it easy to apply the customized motion to different settings.
The dual architecture consists of a spatial pathway and a temporal pathway. The spatial pathway has a foundational model with trainable spatial LoRAs (low-rank adaptions) integrated into its transformer layers for each video. These spatial LoRAs capture the visual attributes of the input videos by training them using a randomly selected single frame in each training step. On the other hand, the temporal pathway duplicates the foundational model and shares the spatial LoRAs with the spatial pathway to adapt to the appearance of the input video. Additionally, the temporal transformers in this pathway are enhanced with temporal LoRAs, which are trained using multiple frames from the input videos to capture the inherent motion patterns.
By deploying the trained temporal LoRAs, the foundation model can synthesize videos with learned motions and diverse appearances. The dual architecture allows the models to learn the appearance and motion of objects separately, which enables MotionDirector to isolate and combine them from various source videos.
The researchers compared MotionDirector’s performance on benchmarks with different motions and text prompts. On the UCF Sports Action benchmark, MotionDirector was preferred by human raters for better motion fidelity. It outperformed the base models on the LOVEU-TGVE-2023 benchmark as well. These results demonstrate MotionDirector’s ability to customize numerous base models and produce videos with diversity and desired motion concepts.
MotionDirector is a promising method for adapting text-to-video diffusion models to generate videos with specific motions. It excels in learning and adapting specific motions of subjects and cameras, and it can generate videos with a wide range of visual styles.
While there is room for improvement in learning the motion of multiple subjects in reference videos, MotionDirector has the potential to enhance flexibility in video generation. It allows users to create videos that are tailored to their preferences and requirements.
Check out the Paper, Project, and Github for more information on MotionDirector. All credit for this research goes to the researchers on this project. Don’t forget to join our ML SubReddit, Facebook Community, Discord Channel, and Email Newsletter for the latest AI research news and cool AI projects.
If you like our work, you will love our newsletter. Subscribe to our newsletter for more AI updates.
Join our AI Channel on Whatsapp to stay connected.
[Watch Now] Now Watch AI Research Updates On Our Youtube Channel.