Voice technology is everywhere these days. But not all languages have the same accuracy, which can make it less inclusive. The amount of data available for different languages is a big factor in how accurate the technology is. This is especially true for training all-neural end-to-end automatic speech recognition (ASR) systems.
There are two techniques that have been successful in improving the accuracy of ASR systems, especially for low-resource languages like Ukrainian. These techniques are cross-lingual knowledge transfer and iterative pseudo-labeling.
Our goal is to train an all-neural ASR system called the Transducer. We want to replace a DNN-HMM hybrid system without using any manually annotated training data. Our tests show that the Transducer system, using transcripts from the hybrid system, reduces word error rate by 18%. But by combining cross-lingual knowledge transfer and iterative pseudo-labeling, we are able to reduce the error rate by 35%.