This article presents an exciting new approach for training Context-Aware Transformer Transducer (CATT) models. By utilizing a clever technique for extracting negative phrases from the context encoder’s latent space, we are able to enhance the training process.
During training, we mine a set of similar phrases using an approximate nearest neighbor search, based on a reference query. These phrases are then incorporated into the context list as negative examples, alongside random and ground truth contextual information. By including these approximate nearest neighbor phrases (ANN-P), we encourage the model to differentiate between similar but not identical biasing phrases. This ultimately improves the accuracy of biasing when faced with multiple similar phrases.
In our experiments, conducted on a large-scale data regime, we achieved impressive results. We observed up to a 7% reduction in relative word error rate for the contextual portion of test data. This highlights the effectiveness of our approach. Additionally, we expanded and evaluated the CATT approach in streaming applications.
With this innovative method, we are pushing the boundaries of AI and enhancing its capabilities in context-awareness. By refining the training process and incorporating negative examples, we are enabling AI models to better handle diverse and similar phrases. This has significant implications for various applications, including natural language understanding and generation.
Overall, this extension to the CATT model opens up new possibilities and improvements in AI technology. By making use of approximate nearest neighbor search and mining negative phrases, we are taking a step forward in enhancing the accuracy and performance of AI models in handling contextual information.