Apple’s On-Device Voice Trigger System for Natural Voice Interactions
Consumer devices like smart speakers, headphones, and watches are increasingly using speech as the main input method. This has made voice trigger detection systems essential for controlling device access and interaction. Apple has designed a high-accuracy, privacy-focused, power-efficient voice trigger system that enables seamless voice-driven interactions with their devices.
The voice trigger system supports various Apple device categories such as iPhone, iPad, HomePod, AirPods, Mac, Apple Watch, and Apple Vision Pro. These devices can detect two trigger phrases: “Hey Siri” and “Siri.”
Addressing Specific Challenges
Apple’s voice trigger system overcomes four key challenges:
1. Differentiating the primary user from other speakers
2. Rejecting false triggers from background noise
3. Discarding trigger-like acoustic segments
4. Supporting a shorter and challenging trigger phrase across different regions
Voice Trigger System Architecture
The architecture of the voice trigger system involves multiple stages. On mobile devices, the audio is continuously analyzed by the Always On Processor (AOP). A ring buffer stores the streaming audio, which is then examined by a high-recall voice trigger detector. Any audio without trigger keywords is discarded. Audio with potential triggers undergoes analysis by a high-precision voice trigger checker on the Application Processor (AP). The speaker identification system determines if the trigger phrase is spoken by the device owner or another user. The Siri directed speech detection system analyzes the user’s full utterance, including the trigger phrase, to avoid false triggers.
Streaming Voice Trigger Detector
The first stage of the voice trigger detection system is a low-power, deep neural network (DNN) model that predicts speech frame probabilities. A hidden Markov model (HMM) decoder combines the DNN predictions to compute keyword detection scores. The DNN model comprises 23 states, including phonemes, silence, and background states. Advanced palette techniques compress the DNN model to reduce computational and memory requirements.
High Precision Conformer-Based Voice Trigger Checker
If a detection is made in the first pass, larger conformer models are used to re-score the candidate acoustic segments. The conformer encoder model utilizes self-attention and convolutional layers for accurate results. The model minimizes connectionist temporal classification (CTC) loss and cross-entropy loss to improve performance. Fine-tuning with trigger phrase-specific data further enhances the model’s ability to discriminate between triggers and similar acoustic segments.
Personalized Voice Trigger System
Unintended activations in voice trigger systems can occur in scenarios like similar-sounding phrases or other users’ voices. To address this, Apple’s system includes personalized features that distinguish the primary user from others. This ensures that only the intended voice commands activate the device.
In conclusion, Apple’s on-device voice trigger system brings high accuracy, privacy, and power efficiency to enable natural voice interactions with their devices. With advanced algorithms and personalized features, Apple continues to enhance the user experience and make voice control more seamless.