Neural text-to-speech (TTS) is a technology that can create speech that sounds almost human-like, as long as there is enough high-quality speech available for training. However, getting this speech data can be expensive and time-consuming, especially if you want to generate different speaking styles.
But here’s the good news: researchers have found a way to transfer speaking styles between different speakers and improve the quality of synthetic speech. They did this by training a model called multi-speaker multi-style (MSMS) with not just regular TTS recordings, but also with long-form recordings.
Their study uncovered three important findings:
1. Multi-speaker modeling makes the overall TTS quality better.
2. The MSMS approach is better than the pre-training and fine-tuning approach when extra multi-speaker data is used.
3. People really like the long-form speaking style, no matter what the target text is about.
By using this MSMS model and incorporating long-form recordings, researchers have made significant advancements in TTS technology. This means that in the future, we can expect even higher quality synthetic speech that mimics natural human speech styles, without the need for extensive and expensive data collection.
So, the next time you hear a voice that sounds almost human, it might just be artificial intelligence at work, using the power of MSMS modeling and long-form speaking style.