Deepgram has made a reputation for itself as one of many go-to startups for voice recognition. At the moment, the well-funded firm introduced the launch of Aura, its new real-time text-to-speech API. Aura combines extremely practical voice fashions with a low-latency API to permit builders to construct real-time, conversational AI brokers. Backed by massive language fashions (LLMs), these brokers can then stand in for customer support brokers in name facilities and different customer-facing conditions.
As Deepgram co-founder and CEO Scott Stephenson informed me, itβs lengthy been potential to get entry to nice voice fashions, however these had been costly and took a very long time to compute. In the meantime, low latency fashions are likely to sound robotic. Deepgramβs Aura combines human-like voice fashions that render extraordinarily quick (sometimes in nicely underneath half a second) and, as Stephenson famous repeatedly, does so at a low value.
βAll people now could be like: βhey, we’d like real-time voice AI bots that may understand what’s being mentioned and that may perceive and generate a response β after which they will communicate again,’β he mentioned. In his view, it takes a mixture of accuracy (which he described as desk stakes for a service like this), low latency and acceptable prices to make a product like this worthwhile for companies, particularly when mixed with the comparatively excessive value of accessing LLMs.
Deepgram argues that Auraβs pricing at the moment beats just about all its opponents at $0.015 per 1,000 characters. Thatβs not all that far off Googleβs pricing for its WaveNet voices at 0.016 per 1,000 characters and Amazonβs Pollyβs Neural voices on the similar $0.016 per 1,000 characters, however β granted β it’s cheaper. Amazonβs highest tier, although, is considerably costlier.
βIt’s a must to hit a very good value level throughout all [segments], however then you need to even have superb latencies, pace β after which superb accuracy as nicely. So itβs a very laborious factor to hit,β Stephenson mentioned about Deepgram normal strategy to constructing its product. βHowever that is what we centered on from the start and that is why we constructed for 4 years earlier than we launched something as a result of we had been constructing the underlying infrastructure to make that actual.β
Aura presents round a dozen voice fashions at this level, all of which had been educated by a dataset Deepgram created along with voice actors. The Aura mannequin, identical to the entire firmβs different fashions, had been educated in-house. Here’s what that feels like:
You possibly can strive a demo of Aura right here. Iβve been testing it for a bit and though youβll typically come throughout some odd pronunciations, the pace is absolutely what stands out, along with Deepgramβs present high-quality speech-to-text mannequin. To spotlight the pace at which it generates responses, Deepgram notes the time it took the mannequin to start out talking (typically lower than 0.3 seconds) and the way lengthy it took the LLM to complete producing its response (which is often just below a second).