OpenAI upgrades its transcription and voice-generating AI models

OpenAI is bringing new transcription and voice-generating AI fashions to its API that the corporate claims enhance upon its earlier releases.

For OpenAI, the fashions match into its broader “agentic” imaginative and prescient: constructing automated methods that may independently accomplish duties on behalf of customers. The definition of “agent” may be in dispute, however OpenAI Head of Product Olivier Godement described one interpretation as a chatbot that may converse with a enterprise’s prospects.

“We’re going to see increasingly brokers pop up within the coming months” Godement informed Trendster throughout a briefing. “And so the final theme helps prospects and builders leverage brokers which are helpful, obtainable, and correct.”

OpenAI claims that its new text-to-speech mannequin, “gpt-4o-mini-tts,” not solely delivers extra nuanced and realistic-sounding speech however can be extra “steerable” than its previous-gen speech-synthesizing fashions. Builders can instruct gpt-4o-mini-tts on methods to say issues in pure language — for instance, “converse like a mad scientist” or “use a serene voice, like a mindfulness instructor.”

Right here’s a “true crime-style,” weathered voice:

And right here’s a pattern of a feminine “skilled” voice:

Jeff Harris, a member of the product workers at OpenAI, informed Trendster that the aim is to let builders tailor each the voice “expertise” and “context.”

“In numerous contexts, you don’t simply need a flat, monotonous voice,” Harris mentioned. “When you’re in a buyer assist expertise and also you need the voice to be apologetic as a result of it’s made a mistake, you may even have the voice have that emotion in it … Our huge perception, right here, is that builders and customers need to actually management not simply what’s spoken, however how issues are spoken.”

As for OpenAI’s new speech-to-text fashions, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” they successfully exchange the corporate’s long-in-the-tooth Whisper transcription mannequin. Skilled on “numerous, high-quality audio datasets,” the brand new fashions can higher seize accented and different speech, OpenAI claims, even in chaotic environments.

They’re additionally much less more likely to hallucinate, Harris added. Whisper notoriously tended to manufacture phrases — and even complete passages — in conversations, introducing the whole lot from racial commentary to imagined medical remedies into transcripts.

“[T]hese fashions are a lot improved versus Whisper on that entrance,” Harris mentioned. “Ensuring the fashions are correct is totally important to getting a dependable voice expertise, and correct [in this context] implies that the fashions are listening to the phrases exactly [and] aren’t filling in particulars that they didn’t hear.”

Your mileage could fluctuate relying on the language being transcribed, nevertheless.

In accordance with OpenAI’s inner benchmarks, gpt-4o-transcribe, the extra correct of the 2 transcription fashions, has a “phrase error charge” approaching 30% (out of 120%) for Indic and Dravidian languages corresponding to Tamil, Telugu, Malayalam, and Kannada. Meaning three out of each 10 phrases from the mannequin will differ from a human transcription in these languages.

The outcomes from OpenAI transcription benchmarking.Picture Credit:OpenAI

In a break from custom, OpenAI doesn’t plan to make its new transcription fashions overtly obtainable. The corporate traditionally launched new variations of Whisper for business use beneath an MIT license.

Harris mentioned that gpt-4o-transcribe and gpt-4o-mini-transcribe are “a lot larger than Whisper” and thus not good candidates for an open launch.

“[T]hey’re not the type of mannequin you can simply run domestically in your laptop computer, like Whisper,” he continued. “[W]e need to be sure that if we’re releasing issues in open supply, we’re doing it thoughtfully, and we now have a mannequin that’s actually honed for that particular want. And we predict that end-user gadgets are some of the attention-grabbing instances for open-source fashions.”

Up to date March 20, 2025, 11:54 a.m. PT to make clear the language round phrase error charge and up to date the benchmark outcomes chart with a more moderen model.