OpenAI Tests New Voice Clone Model
The ChatGPT parent company is testing dramatically improved text-to-speech technology with select developers. Experts warn that realistic voice generation raises serious safety questions.
OpenAI on Friday announced a pilot program for its new custom voice text-to-speech (TTS) offering, called Voice Engine, that will allow users to create realistic speech from text with just a small snippet of audio sample.
In a blog post, the ChatGPT maker says it's currently working with developers to test the newest model in its application programming interface (API), which can take a single 15-second sample of audio to create natural-sounding speech closely matching the original speech. Those developers agreed to a strict usage policy, which prohibits the impersonation of another individual or organization without consent or legal right. Partners must also require explicit informed consent from the original speaker.
In a live demo with InformationWeek, OpenAI Product Lead Jeff Harris showed how a quick live recording of his voice could be used to create a text-to-speech sample that was indistinguishable from his real voice. The whole process took just moments.
The speed and realism of OpenAI’s custom voice TTS will likely be an attractive prospect for many commercial and consumer uses, but it also presents serious risks and challenges. The potential for misuse is profound.
That’s why OpenAI is testing the software first with a select group of developers.
Safety First
AI voice cloning is a serious concern for AI ethics, especially in an election year. US President Joe Biden in his State of the Union address on March 6 called for a ban on AI voice impersonations. Biden’s voice was used in an AI voice-impersonating scam in January that urged New Hampshire primary voters to “save their votes” for the November presidential election.
In February, the Federal Communications Commission (FCC) made AI-generated voices in robocalls illegal under the Telephone Consumer Protection Act.
OpenAI, for its part, says it is moving ahead with its voice cloning model carefully. OpenAI’s blog post called for a broad effort that would phase out voice-based authentication now used widely as a security measure.
“We’re going to start with a limited set of developers and people that we have trusted relationships with and ask them to agree to a pretty comprehensive set of terms that includes things like permission from every speaker whose voice is used and making sure that any generated speech is clearly labeled as AI-generated,” Harris tells InformationWeek. Harris said OpenAI has also developed a “watermarking” system that allows identification of voice recordings generated with its model.
Responsible AI Institute founder Manoj Saxena thinks the pilot program is the right approach, but says more guardrails are needed as AI technology continues to rapidly develop. With hyper-realistic voice generation, a criminal could trick family members into scams or worse. And with an election cycle coming up, concerns about deepfakes used to spread misinformation are growing.
“This is a massive, dual-edged sword,” Saxena tells InformationWeek in a phone interview. “This could be another nail in the coffin for truth and data privacy. This adds yet more of an unknown dynamic where you could have something that can create a lot of emotional distress and psychological effects. But I can also see a lot of positives. It all depends on how it gets regulated.”
Saxena hopes OpenAI includes regulators and safety advocates in the pilot process as well.
Voice Cloning Could Impact Business, Workers
OpenAI’s enterprise-grade version of ChatGPT was released in August 2023. An entry-level tier soon followed targeting small and medium-sized businesses. A voice clone feature offering speed and a low barrier to entry could create massive demand from businesses, especially in the customer service sector. According to Statista, there are more than 2.8 million contact center employees in the US alone.
Max Ball, a principal analyst at Forrester, says voice cloning software already exists, but the efficiency of OpenAI’s model could be a game-changer. “It’s a pretty strong step in a couple ways,” Ball tells InformationWeek in an interview. “Today, from what the vendors are showing me, you can do a custom voice, but it takes 15-20 minutes of voice to be able to train it. While 15 minutes doesn’t sound like a lot of time, it’s tough to get anyone to sit down for 15 minutes during a day of work.”
For the call center market, the speed and quality of custom voice will very likely lead to a massive shift in labor needs. “The change we’re going to see there is that it’s going to automate those jobs. And the job of an agent, the agents that are left, it’s going to be a more challenging job -- but a much more valuable job.”
About the Author
You May Also Like