Automatic speech recognition (ASR) technology has made conversations more accessible with live captions in remote conferencing software, mobile applications, and head-worn displays. However, these live caption systems often display interim predictions that are updated as new utterances are received, causing text instability. This can distract users, make them tired, and make it difficult for them to follow the conversation. In a study presented at ACM CHI 2023, we formalize the problem of text stability and propose solutions.
To quantify text instability, we developed a flicker metric that measures luminance contrast and discrete Fourier transform. This metric calculates the flicker in a live caption video by comparing the difference in luminance between frames. By converting the change in luminance to frequencies, we can detect both obvious and subtle changes in the captions.
To improve text stability, we propose a stability algorithm that aligns the old sequence of tokens with the new sequence of ASR predictions. This algorithm considers the natural language understanding aspect and the ergonomic aspect of user experience. It performs tokenized alignment, semantic merging, and smooth animation to produce stabilized captions.
We conducted a user study with 123 participants to evaluate our proposed flicker metric and stabilization techniques. Participants watched different versions of live captions and rated their comfort, distraction, ease of reading, ease of following the video, fatigue, and overall experience. Our study found statistically significant correlations between the flicker metric and users’ ratings, indicating that text instability affects the user experience. Additionally, our stabilized captions with smooth animation received better ratings in terms of comfort, readability, and overall experience.
In conclusion, text stability is important for a better user experience with live captions. Our flicker metric and stabilization techniques provide objective measurements and solutions to address text instability. By implementing these improvements, we can enhance the accessibility and usability of live caption systems in various applications.