Movies have always been a powerful medium for expressing stories and emotions. Take the example of “The Pursuit of Happyness” where the main character goes through various highs and lows, making the audience feel deeply connected to the character’s journey. In the world of artificial intelligence (AI), it is important for machines to understand and analyze the emotions and mental states of characters in movies. This is where MovieGraphs and AI models come into play.
Throughout history, emotions have been studied extensively. From the ancient classifications by Cicero to modern-day brain research, humans have always been fascinated by emotions. Psychologists have developed frameworks like Plutchik’s wheel and Ekman’s universal facial expressions to understand emotions better. Emotions are categorized into mental states that involve aspects of behavior, cognition, and bodily reactions. Recent studies, like the Emotic project, have introduced clusters of emotion labels and continuous dimensions like valence, arousal, and dominance, to accurately capture the complexity of emotions.
To predict a wide range of emotions accurately, AI systems need to analyze various contextual modalities. One approach is Emotion Recognition in Conversations (ERC), which focuses on categorizing emotions based on dialogue exchanges. Another approach involves predicting valence-activity scores for short segments of movie clips. However, a more comprehensive analysis can be done at the scene level, where multiple shots come together to tell a sub-story. This allows for a deeper understanding of the emotions and mental states of each character in the scene.
In this study, the researchers propose EmoTx, an AI model that combines video frames, dialogues, and character appearances to predict emotions. EmoTx utilizes a Transformer-based approach, which involves pre-processing and extracting features from the video data, character faces, and text. These features are then combined using linear layers and fed into a Transformer encoder for integration across different modalities. The classification component of the model is inspired by previous studies on multi-label classification using Transformers.
Overall, EmoTx provides a novel approach to predicting emotions in movies using a Transformer-based architecture. The researchers have demonstrated its effectiveness in a scene from “Forrest Gump.” If you’re interested in learning more, you can refer to the links provided in the article.
Source: https://arxiv.org/abs/2304.05634