The Importance of Device-Directed Speech Detection
Device-directed speech detection (DDSD) plays a crucial role in distinguishing between queries aimed at voice assistants and background conversation or noise. Cutting-edge DDSD systems rely on verbal cues such as acoustic, text, and automatic speech recognition (ASR) features to classify speech accurately. However, these systems often face the challenge of missing modalities when deployed in real-world scenarios.
Improving Robustness with Fusion Schemes
In this research paper, we delve into fusion schemes that enhance the robustness of DDSD systems in the face of missing modalities. By combining the scores and embeddings from prosody (non-verbal cues) with the corresponding verbal cues, we explore various approaches. Our findings reveal that incorporating prosody features can elevate DDSD performance by up to 8.5% in terms of false acceptance rate (FA), creating a more reliable system at a given fixed operating point.
Enhancing Performance with Modality Dropout Techniques
Additionally, we investigate the use of modality dropout techniques to further enhance the performance of DDSD models. These techniques improve the models’ ability to handle missing modalities during inference time. Our evaluation demonstrates that implementing modality dropout techniques results in a 7.4% decrease in false acceptance rate (FA), solidifying the effectiveness of these models.
Overall, our research highlights the significance of fusion schemes and modality dropout techniques in improving DDSD systems’ resilience to missing modalities. By incorporating prosody features and implementing modality dropout, we can achieve more accurate and reliable device-directed speech detection.