Voice Triggering Workshop at ICASSP 2024
A recent paper accepted at the HSCMA workshop at ICASSP 2024 discusses the importance of voice triggering (VT) in enabling users to activate their devices through a trigger phrase. Typically, a front-end system is used for speech enhancement and/or separation, producing multiple enhanced and/or separated signals. However, conventional VT systems only take single-channel audio as input, leading to the discarding of potentially useful information in unselected channels.
Multichannel Acoustic Models
The paper introduces multichannel acoustic models for VT, where the output from the front-end is fed directly into a VT model. By incorporating a transform-average-concatenate (TAC) block and modifying it to include channel information from conventional channel selection, the model can effectively focus on a target speaker in the presence of multiple speakers. This approach has shown a significant 30% reduction in the false rejection rate compared to the baseline channel selection method.