Home AI News Bark: The New Text2Speech Model for Voice Cloning and Audio Replication

Bark: The New Text2Speech Model for Voice Cloning and Audio Replication

Bark: The New Text2Speech Model for Voice Cloning and Audio Replication

The latest text-to-speech model, called Bark, has been released. It has certain constraints on voice cloning and allows prompts to ensure user safety. However, scientists have managed to decode the audio samples, remove the constraints, and make them accessible in a Jupyter notebook. This means that with just 5-10 seconds of audio/text samples, it is now possible to clone an entire audio file.

What is Bark?

Bark is an innovative text-to-audio model developed by Suno. It is built on GPT-style models and has the ability to generate natural-sounding speech in multiple languages, as well as music, noise, and simple sound effects. The model utilizes a transformer to achieve this. In addition to speech and music, Bark can also generate facial expressions like smiling, frowning, and sobbing.

How does it work?

Bark uses GPT-style models to generate audio from scratch, similar to other impressive works like Vall-E. However, instead of phonemes, high-level semantic tokens are used to incorporate the first text prompt. This allows the model to generalize to non-speech sounds, such as music lyrics or sound effects, in addition to speech. The semantic tokens are then converted into audio codec tokens using a second model to create the entire waveform.


– Bark supports multiple languages and can automatically detect the user’s input language. While English currently has the highest quality, the quality of other languages will improve as the model is scaled. Bark also adjusts the accent according to the corresponding language when presented with code-switched text.

– Bark can generate any type of sound, including music. The model does not differentiate between speech and music, and sometimes it may even create music based on words.

– The model can reproduce every aspect of a human voice, including timbre, pitch, inflection, and prosody. It is also capable of preserving environmental sounds, music, and other inputs. Due to Bark’s automated language recognition, it can incorporate prompts in a different language, resulting in audio with a corresponding accent.

– Users can specify the voice of a specific character by providing prompts like NARRATOR, MAN, WOMAN, etc. However, these directions may not always be followed if conflicting audio history prompts are provided.


Bark has been validated for CPU and GPU implementations (pytorch 2.0+, CUDA 11.7, and CUDA 12.0). It can generate near real-time audio on current GPUs using PyTorch. However, it should be noted that running transformer models with over a hundred million parameters may result in slower inference times on older GPUs, the default collab, or a CPU.

To learn more about Bark, you can check out the Repo and Blog. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and cool AI projects. If you have any questions or if we missed anything, feel free to email us at Asif@marktechpost.com.

Check out AI Tools Club for hundreds of AI tools.

Source link


Please enter your comment!
Please enter your name here