The AI community is experiencing a significant impact from large language models (LLMs), such as ChatGPT and GPT-4. These models have advanced natural language processing, allowing them to read, write, and converse like humans. However, their success in audio processing (including voice, music, sound, and talking heads) is limited. This is a problem because humans primarily communicate using spoken language and use spoken assistants for convenience.
Training LLMs to support audio processing is challenging due to two main issues. Firstly, there are few sources of real-world spoken conversations, and obtaining labeled speech data is expensive and time-consuming. Additionally, there is a lack of multilingual conversational speech data compared to the abundance of web-text data. Secondly, training multi-modal LLMs from scratch requires significant computational resources and time.
To address these challenges, researchers from Zhejiang University, Peking University, Carnegie Mellon University, and the Remin University of China have developed “AudioGPT.” This system excels in understanding and producing audio in spoken dialogues without training multi-modal LLMs from scratch. Instead, they utilize various audio foundation models and connect LLMs with input/output interfaces for speech conversations.
The AudioGPT process consists of four parts: modality transformation, task analysis, model assignment, and response generation. Modality transformation involves converting speech to text using input/output interfaces and spoken language LLMs. Task analysis utilizes the conversation engine and prompt manager in ChatGPT to understand user intent from audio data. Model assignment allocates audio foundation models for comprehension and generation based on structured arguments for prosody, timbre, and language control. Finally, response generation involves generating and providing users with a final answer using the audio foundation models.
AudioGPT has been proven effective in processing complex audio data in multi-round dialogues for various AI applications, such as speech, music, sound, and talking heads. The system’s design concepts and evaluation procedures evaluate its consistency, capacity, and robustness. This research contribution provides ChatGPT with audio foundation models for sophisticated audio tasks and enables spoken communication through a modalities transformation interface.
Overall, AudioGPT allows for the production of rich and diverse audio content with ease. It provides a solution to the challenges of training LLMs for audio processing and has open-source code available on GitHub for further exploration. Join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and projects. For any questions or feedback, feel free to email us at Asif@marktechpost.com.