Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.
Not content to disrupt merely text generation, imagery, and video with its various AI models, ChatGPT-maker OpenAI is also getting into the last major form of legacy digital media: audio. Specifically, voice cloning.
The company today is announcing its newest AI model, “Voice Engine,” which it says has been in development since 2022 and currently powers OpenAI’s text-to-speech API and the new ChatGPT Voice and Read Aloud features unveiled earlier this month.
As it turns out, the model can also preform voice cloning. Here’s how it works: a human speaker records a 15-second clip of their voice through a phone or computer microphone, and OpenAI’s Voice Engine generates “natural-sounding speech that closely resembles the original speaker,” and can be used henceforth going forward, to speak aloud any text that a human user types in.
Enormous implications for spoken audio market
The tech has obviously huge implications for those who record themselves speaking often, be they podcasters, voice over artists, spoken word performers, audiobook and advertising narrators, gamers, streamers, customer service agents, salespersons, and many other occupations and disciplines.
VB Event
The AI Impact Tour – Atlanta
Request an invite
It also puts pressure on other companies dedicated to this type of tech, such as well-funded AI startup ElevenLabs, Captions, Meta, WellSaid Labs, MyShell, and others.
OpenAI further highlight’s Voice Engine’s capability to offer support for non-verbal individuals, providing them with unique, non-robotic voices, and aid in therapeutic and educational programs for those with speech impairments or learning needs.
Initial use cases
OpenAI said in its blog post announcing Voice Engine today that so far, it has only made the tech available to a “small group of trusted partners.” Among those highlighted and named are
- Age of Learning, an education technology company that uses Voice Engine and GPT-4 for generating pre-scripted and real-time personalized voice content, expanding reading assistance and interactivity for a diverse student audience.
- HeyGen, an AI visual storytelling platform that enables creators and businesses to translate their content into multiple languages, employs Voice Engine for video translation, creating custom human-like avatars with multilingual voices, preserving original speaker’s accent to reach a global audience.
- Dimagi, a software company making tools for community health workers, uses Voice Engine and GPT-4 to provide interactive feedback in various languages for said workers, improving essential service delivery in remote settings.
- Livox, an AI app for Augmentative and Alternative Communication (AAC) devices used by those with speech and hearing difficulties, integrates Voice Engine to provide unique, non-robotic voices across languages for non-verbal individuals.
- The Norman Prince Neurosciences Institute at Lifespan, a nonprofit medical and teaching organization at Brown University, dedicated to helping those with neurological diseases and disorders, is using Voice Engine to assist those with speech impairments in using the AI version of their voice. Two doctors there, Rohaid Ali and pediatric neurosurgeon Konstantina Svokos, have already successfully restored a brain tumor patient’s speech using an audio sample from one of her school project videos.
The company uploaded to its blog, and emailed to VentureBeat under embargo, several audio samples showing the tech’s humanlike speaking capabilities. For example, here’s the original “source voice” of Lifespan’s patient:
And here’s the cloned voice using OpenAI Voice Engine:
Limited user base by design
Yet for now, the tech is limited. As with its powerful, incredibly realistic and vivid video generation AI model Sora, OpenAI is not presently allowing the public to use Voice Engine. Instead, today OpenAI is simply sharing the existence of the tool and “preliminary insights and results from a small-scale preview” with “a small group of trusted partners” who have been given access.
As OpenAI states in its blog post today announcing the tech:
“We are taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse. We hope to start a dialogue on the responsible deployment of synthetic voices and how society can adapt to these new capabilities. Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale.”
The cautious, slow-and-steady, limited access approach to releasing Voice Engine makes sense especially in light of U.S. President Joseph R. Biden’s recent call to “ban AI voice impersonation.”
Central to OpenAI’s deployment strategy is a stringent adherence to safety and ethical guidelines. Partners involved in testing Voice Engine are bound by usage policies that prohibit unauthorized impersonation and require informed consent from voice donors.
Additionally, OpenAI has implemented safety measures such as watermarking and proactive monitoring to ensure the technology’s responsible use.