Automatic speech recognition (ASR) technology is widely used for various applications like conference calls, video transcription, and voice commands. One area of ASR that shows promise is audiovisual ASR (AV-ASR), which combines both audio and visual information. However, the lack of visible mouth movements in videos makes it challenging to effectively use visual cues for speech recognition. This has led to the emergence of unconstrained AV-ASR, which explores the contribution of entire visual frames.
Creating datasets for training AV-ASR models is difficult, as existing datasets are small and models tend to overfit on them. On the other hand, audio-only models have been optimized using large-scale training on massive audio data. In our research, we propose AVFormer, a method for augmenting large-scale audio-only models with visual information. We use lightweight adaptable modules to inject visual embeddings into a frozen ASR model. These modules can be trained on a small amount of weakly labeled video data with minimal additional training time and parameters.
To ensure effective processing of both audio and visual information, we introduce a curriculum scheme during training. This scheme allows the model to learn adapters for domain adaptation and visual projection layers sequentially. We apply each phase only once to prevent performance degradation.
Our AVFormer model achieves state-of-the-art zero-shot performance on three AV-ASR benchmarks: How2, VisSpeech, and Ego4D. It also maintains decent performance on traditional audio-only benchmarks like LibriSpeech. Compared to existing models, AVFormer requires significantly fewer parameters and smaller training datasets while achieving better performance.
In conclusion, AVFormer is a practical and efficient method for adapting existing ASR models for AV-ASR. It allows for domain transfer and mixing of visual inputs in a parameter-efficient manner.