Significant Advances in AI Zero-Shot Voice Conversion
A team of researchers from Northwestern Polytechnical University, China, and ByteDance have introduced StreamVoice, a new streaming language model (LM)-based method for zero-shot voice conversion (VC). StreamVoice allows real-time conversion with any speaker prompts and source speech by using a fully causal context-aware LM with a temporal-independent acoustic predictor.
With StreamVoice, semantic and acoustic features are processed at each autoregression time step, eliminating the need for complete source speech. To ensure performance, the model uses teacher-guided context foresight and a semantic masking strategy. StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead, and it has been shown to maintain zero-shot performance comparable to non-streaming VC systems.
The research demonstrates that StreamVoice can conduct speech conversion in a streaming fashion, achieving high speaker similarity for both familiar and unfamiliar speakers. Additionally, the model has a latency of only 124 ms for the conversion process, making it 2.4 times faster than real-time on a single A100 GPU, even without engineering optimizations. The team plans to optimize the streaming pipeline and incorporate a high-fidelity codec with a low bitrate and a unified streaming model.
For more details, see the Paper. Don’t forget to follow the researchers’ work on various social media platforms such as Twitter, Reddit, Facebook, Discord, and LinkedIn. If you love their work, check out their newsletter and join their Telegram Channel.