Large Language Models (LLMs) have made significant progress in the Natural Language Processing (NLP) field. To further enhance their capabilities, researchers have developed Multimodal Large Language Models (MLLMs) that can perform multimodal perception and interpretation. MLLMs have shown impressive skills in tasks like perception, commonsense reasoning, and code reasoning, bringing a more human-like perspective and a wider range of task-solving abilities compared to other models.
Existing vision-centric MLLMs use different components like the Q-former, pre-trained LLMs, and a visual encoder to process multimodal data. Another approach combines existing visual perception tools with LLMs through API to create a system without additional training. However, previous studies haven’t explored the effectiveness of lengthy videos or provided criteria for evaluating these systems.
In this study, researchers from Zhejiang University, University of Washington, Microsoft Research Asia, and Hong Kong University introduce MovieChat, a framework that combines vision models with LLMs for interpreting lengthy videos. They address challenges like computing difficulty, memory expense, and long-term temporal linkage by proposing a memory system based on the Atkinson-Shiffrin model.
The MovieChat framework enables extended video comprehension tasks and achieves state-of-the-art performance. It incorporates a memory process inspired by the Atkinson-Shiffrin model, representing short-term and long-term memory using tokens in Transformers. By addressing long-term temporal relationships while minimizing memory usage and computing complexity, MovieChat enhances video comprehension. The framework has practical applications in content analysis, video recommendation systems, and video monitoring.
Future research could focus on strengthening the memory system and incorporating additional modalities like audio to further enhance video comprehension. Overall, MovieChat opens up possibilities for applications that require a comprehensive understanding of visual data.
Check out the Paper, GitHub, and Project for more information on this research.