Natural language processing (NLP) has been focusing on large-scale Transformers. These models have shown impressive abilities in various applications. Similar pre-training methods have been successful in voice processing. The OpenAI Whisper project developed a collection of multilingual, multitask models called OpenAI Whisper, which used carefully selected voice data from various online sources.
However, the complete process for model building is not available to the public. This poses several issues:
1. Using pre-trained models on new benchmarks without knowledge of the training data can lead to data leakage.
2. Researchers have difficulty understanding the underlying mechanisms and improving the model’s performance without access to the training dynamics.
3. Dealing with problems like robustness, fairness, bias, and toxicity becomes more challenging without access to the entire model development pipeline.
To promote open science, a research team from Carnegie Mellon University, Shanghai Jiao Tong University, and Honda Research Institute created the Open Whisper-style Speech Model (OWSM). OWSM replicates the whisper-style training using an open-source toolbox and publicly available data. It adopts the Whisper framework for crucial tasks like language identification (LID) and automated speech recognition (ASR).
OWSM also introduces technical innovations. It handles any-to-any speech translation instead of just any-to-English translation. It also uses various tactics to improve efficiency. The pipeline, including data preparation, training, inference, and scoring, will be covered by reproducible recipes. The team plans to make pre-trained models and training logs available for researchers to gain important knowledge about the training procedure.
While OWSM performs similarly or better on some metrics compared to Whisper, it’s not intended to compete with Whisper. OWSM’s dataset is only 25% of the size used by Whisper, and resource constraints limit trial runs.
In the future, the team plans to improve OWSM by using more sophisticated encoder or decoder architectures and gathering more diverse data. They also aim to add other speech-processing tasks to create “universal speech models.”
You can check out the full paper and code for more information. Credit goes to the researchers on this project. Don’t forget to join our ML SubReddit, Facebook Community, Discord Channel, and Email Newsletter for the latest AI research news and projects.
If you enjoy our work, you’ll love our newsletter.