The research on Parallel Text-to-Speech (TTS) models at NVIDIA Corporation has proposed a new system called Incremental FastPitch, which is faster and provides better control for real-time speech synthesis. This model can produce high-quality Mel chunks with lower latency, making it perfect for real-time and streaming applications. It employs training with constrained receptive fields and explores the use of both static and dynamic chunk masks. The incremental FastPitch can generate Mel-spectrograms with almost no observable difference compared to parallel FastPitch, offering a faster and more controllable speech synthesis process.
This model is training and evaluating using the Chinese Standard Mandarin Speech Corpus, producing speech quality comparable to parallel FastPitch, with significantly lower latency. The proposed model incorporates chunk-based FFT blocks, training with receptive field-constrained chunk attention masks, and inference with fixed-size past model states, contributing to improved performance. The research results highlight the effectiveness of the model and its potential in real-time and streaming applications. It’s a promising approach for real-time applications. You can read the full paper for more information.