Two undergrads built an AI speech model to rival NotebookLM

A pair of undergrads, neither with intensive AI experience, say that they’ve created an brazenly obtainable AI mannequin that may generate podcast-style clips much like Google’s NotebookLM.

The marketplace for artificial speech instruments is huge and rising. ElevenLabs is among the largest gamers, however there’s no scarcity of challengers (see PlayAI, Sesame, and so forth). Traders imagine that these instruments have immense potential. In accordance with PitchBook, startups creating voice AI tech raised over $398 million in VC funding final yr.

Toby Kim, one of many Korea-based co-founders of Nari Labs, the group behind the newly launched mannequin, mentioned that he and his fellow co-founder began studying about speech AI three months in the past. Impressed by NotebookLM, they needed to create a mannequin that provided extra management over generated voices and “freedom within the script.”

Kim says they used Google’s TPU Analysis Cloud program, which supplies researchers with free entry to the corporate’s TPU AI chips, to coach Nari’s mannequin, Dia. Weighing in at 1.6 billion parameters, Dia can generate dialogue from a script, letting customers customise audio system’ tones and insert disfluencies, coughs, laughs, and different nonverbal cues.

Parameters are the inner variables fashions use to make predictions. Usually, fashions with extra parameters carry out higher.

Accessible from the AI dev platform Hugging Face and GitHub, Dia can run on most fashionable PCs with not less than 10GB of VRAM. It generates a random voice until prompted with an outline of an meant model, however it could additionally clone an individual’s voice.

In Trendster’s temporary testing of Dia by Nari’s internet demo, Dia labored fairly nicely, uncomplaining producing two-way chats about any topic. The standard of the voices appears aggressive with different instruments on the market, and the voice cloning perform is among the many best this reporter has tried.

Right here’s a pattern:

Like many voice mills, Dia provides little in the way in which of safeguards, nevertheless. It’d be trivially simple to craft disinformation or a scammy recording. On Dia’s mission pages, Nari discourages abuse of the mannequin to impersonate, deceive, or in any other case have interaction in illicit campaigns, however the group says it “isn’t accountable” for misuse.

Nari additionally hasn’t disclosed which knowledge it scraped to coach Dia. It’s attainable Dia was developed utilizing copyrighted content material — a commenter on Hacker Information notes that one pattern sounds just like the hosts of NPR’s “Planet Cash” podcast. Coaching fashions on copyrighted content material is a widespread however legally doubtful apply. Some AI corporations declare that honest use shields them from legal responsibility, whereas rights holders assert that honest use doesn’t apply to coaching.

In any occasion, Kim says Nari’s plan is to create an artificial voice platform with a “social facet” on prime of Dia and bigger, future fashions. Nari additionally intends to launch a technical report for Dia, and to broaden the mannequin’s assist to languages past English.