Microsoft Previews Audio Cloning AI System


Microsoft previewed an audio cloning system, VALL-E, which is a text-to-speech audio synthesizer that can copy a person’s voice with three seconds of audio.

Microsoft has previewed an audio cloning system, named VALL-E, which is capable of mimicking a person’s voice from as little as three seconds of audio from the chosen individual. 

VALL-E uses text-to-speech technology to convert written words into speech, and has been trained on 60,000 hours of English speech with 7,000 unique voices from LibraLight, an public-domain audiobook dataset.

See Also: Could ChatGPT Represent A Challenge to Google Search?

Microsoft previewed the technology on academic site arXiv, but have not made it freely available for the public to try. The company has also not confirmed if it plans to make VALL-E public, or its intended purpose. 

On the preview, Microsoft mentions that VALL-E lacks a diversity of accents, primarily being made from native English speakers, and that synthesized words are often unclear or missed. However, the researchers also say it “significantly outperforms” the most advanced systems available today. 

Even though VALL-E is a nod to DALL-E, OpenAI’s generative image system, the two are not related and OpenAI is not a lead research partner on VALL-E. 

Ethical issues are abound with a technology like this, as we already see with less advanced text-to-speech systems which have synthesized celebrities and politicians voices. Microsoft researchers discussed some of them in the preview: 

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.”

How this protocol would be integrated in practice is not known, and it looks like Microsoft have still not properly thought of the ways scammers could use short voice clips to scam people over the telephone and online. That said, not making VALL-E public is some assurance that the company understands the risks involved with it. 

Microsoft is considered by AI experts and insiders to be behind Google and Meta Platforms in AI sophistication and scope, however, its investment into OpenAI and exclusive licensing of the GPT-3 underlying model has provided it with a potential way back into the lead. 

It announced this week a “multi-billion dollar” investment into OpenAI, bringing the two companies closer together and potentially providing Microsoft with first-dibs on GPT-4, which is expected to be another major leap forward in the sophistication of foundational models. 

Generative systems look to be the AI story of 2023, with ChatGPT exploding in popularity and DALL-E being the catalyst for the launch of hundreds of generative image editor apps, some of which have reached the top of the US iPhone and Android app store charts.

David Curry

About David Curry

David is a technology writer with several years experience covering all aspects of IoT, from technology to networks to security.

Leave a Reply

Your email address will not be published. Required fields are marked *