Visual Captions: Enhancing Video Communication with Real-Time Visuals
Video conferencing has made significant advancements in improving remote communication, such as live captioning and noise cancellation. However, there are scenarios where dynamic visual augmentation can enhance the conveyance of complex information. For example, in a discussion about ordering food at a Japanese restaurant, visual aids can help individuals feel more confident about their choices. Similarly, when sharing experiences, like a family trip to San Francisco, showing personal photos can enhance the conversation.
Visual Captions: Augmenting Verbal Communication:
In a recent development, presented at ACM CHI 2023, a system called Visual Captions has been introduced. This system uses verbal cues to enhance video communication with real-time visuals. It utilizes a large language model that proactively suggests relevant visuals in open-vocabulary conversations. The Visual Captions system has been made available as part of the ARChat project, which enables rapid prototyping of augmented communication with real-time transcription.
Design Space for Dynamic Visual Augmentation:
To understand the needs of potential users, 10 participants with various backgrounds were invited to discuss their requirements for a real-time visual augmentation service. Through these discussions, eight dimensions for visual augmentation in conversations were identified. These dimensions include temporal aspects, expressing and understanding speech content, visual content variety, scale of meetings, settings of meetings, privacy preferences, initiation levels, and modes of interaction.
Collection of a Specific Training Set:
To cater to the contextual needs of conversations, a specific training set was gathered. This dataset includes 1595 language-visual content pairs, covering various contexts like daily conversations, lectures, and travel guides.
Visual Intent Prediction Model:
A visual intent prediction model was trained using a large language model and the specific training dataset. This model can handle open-vocabulary conversations and predict visual content, visual source, and visual type based on contextual cues.
The utility of the Visual Captions model was evaluated with 89 participants performing 846 tasks. The feedback indicated that visuals were preferred during conversations, considered useful and informative, and relevant to the original speech. The accuracy of predicted visual type and source in context was also appreciated.
Visual Captions was developed as an interactive widget for video conferencing platforms like Google Meet. It operates by capturing the user’s speech, predicting visuals every 100 ms, retrieving relevant visuals, and suggesting them in real time.
Levels of Proactivity:
Visual Captions offers three levels of proactivity in suggesting visuals: auto-display, auto-suggest, and on-demand-suggest. These levels allow users to adjust the system’s involvement based on their preferences and the social scenario.
Visual Captions was evaluated in controlled lab studies and in-the-wild deployment studies. Participants found that real-time visuals improved conversations by clarifying concepts, resolving language ambiguities, and increasing engagement. Different levels of proactivity were preferred in different social scenarios.
Conclusion and Future Directions:
In conclusion, Visual Captions provides a system for real-time visual augmentation of video communication. Future directions include further enhancing the system’s capabilities based on user feedback and expanding its compatibility with other video conferencing platforms.
Overall, Visual Captions offers a valuable tool for enhancing communication through the addition of real-time visuals, making video conversations more engaging and informative.