Home AI News Introducing VITS2: Enhancing Naturalness and Efficiency in Text-to-Speech Synthesis

Introducing VITS2: Enhancing Naturalness and Efficiency in Text-to-Speech Synthesis

0
Introducing VITS2: Enhancing Naturalness and Efficiency in Text-to-Speech Synthesis

The paper introduces VITS2, a text-to-speech model that improves natural speech synthesis. It addresses issues like unnaturalness, computational efficiency, and dependence on phoneme conversion. The model enhances naturalness, speech similarity in multi-speaker models, and training and inference efficiency.

Previous Methods

Two-Stage Pipeline Systems: These systems divided the process of generating waveforms from input texts into two stages. The first stage produced intermediate speech representations, while the second stage generated raw waveforms. These systems had limitations such as error propagation and reliance on human-defined features.

Single-Stage Models: Recent studies have explored single-stage models that directly generate waveforms from input texts. These models have outperformed two-stage systems and can generate high-quality speech.

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech was a significant prior work in single-stage text-to-speech synthesis. This approach had intermittent unnaturalness, low efficiency, complex input format, and strong dependence on phoneme conversion.

The current paper addresses the issues found in the previous single-stage model and introduces improvements in text-to-speech synthesis.

Improvements in four areas:

  • Duration prediction: A stochastic duration predictor is proposed, trained through adversarial learning.
  • Augmented variational autoencoder with normalizing flows: The model introduces a transformer block into the normalizing flows for capturing long-term dependencies.
  • Alignment search: The Monotonic Alignment Search (MAS) is used for alignment, with modifications for quality improvement.
  • Speaker-conditioned text encoder: A text encoder is designed to mimic the speech characteristics of each speaker.

The proposed methods were evaluated on datasets and showed significant improvement in the quality of synthesized speech. The paper acknowledges the existence of other problems in speech synthesis and hopes that their work will contribute to future research.


Check out the paper and Github. All credit for this research goes to the researchers on this project. Also, join our ML SubReddit, Facebook Community, Discord Channel, and Email Newsletter for the latest AI research news and cool projects.

If you like our work, you will love our newsletter.

Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here