Meta recently unveiled Voicebox, an advanced artificial intelligence (AI) tool capable of generating speech from text. Developed by Meta, the parent company of Facebook, Voicebox boasts the ability to create high-quality audio clips and edit pre-recorded audio while maintaining the original content and style.
Notably, this multilingual tool supports speech delivery in six languages and incorporates machine learning for effective noise removal. Another remarkable feature is Voicebox's capacity to replace misspoken words without necessitating a complete re-recording of the speech. This groundbreaking generative text-to-speech model aligns with other remarkable AI innovations like ChatGPT and Dall-E.
Voicebox, the latest offering from Meta, the parent company of Facebook, was introduced to the public through a blog post last week. This cutting-edge generative AI model is designed to tackle speech generation tasks such as editing, sampling, and stylizing.
One of its remarkable features is the ability to produce audio clips using just a two-second audio sample, while also retaining the original content and style of the audio. Additionally, Voicebox excels in editing pre-recorded audio, providing users with versatile tools to enhance their audio content.
The text-to-speech model guarantees a range of capabilities, including noise removal, content editing, style conversion, and generating diverse samples. It has the ability to modify specific segments of an input sample and recreate interrupted speech caused by external noises like car horns or barking dogs. Moreover, the AI model offers the convenience of replacing misspoken words without requiring a complete re-recording of the entire speech.
Voicebox possesses the capability to synthesize speech in six languages, namely English, French, Spanish, German, Polish, and Portuguese. Remarkably, it can generate a speech reading in any of these languages, even when the input sample speech and the provided text are in different languages.
According to Meta AI's research paper, Voicebox has asserted its superiority over Microsoft's VALL-E by generating audio samples 20 times faster. "Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech, with 1 percent error rate degradation as opposed to 45 to 70 percent degradation with synthetic speech from previous text-to-speech models", Meta AI detailed in a research paper.
According to Meta's blog, Voicebox is purported to generate speech that closely resembles how people naturally converse in the real world, across the six mentioned languages. Meta envisions that this unique capability could be utilised to generate synthetic data, which could potentially enhance the training of speech assistant models in the coming times.
Voicebox is presently in the development phase and not accessible to the general public. Meta acknowledges the potential for misuse and unintended negative consequences associated with this technology, as seen with other AI advancements.
To address these concerns, Meta is actively working on creating a reliable classifier that can effectively differentiate between authentic speech and audio generated using Voicebox. This measure aims to mitigate any potential risks associated with the technology in the future.