MLCommons and Hugging Face team up to release massive speech dataset for AI research

MLCommons, a nonprofit AI security working group, has teamed up with AI dev platform Hugging Face to launch one of many world’s largest collections of public area voice recordings for AI analysis.

The dataset, known as Unsupervised Folks’s Speech, accommodates greater than one million hours of audio spanning not less than 89 languages. MLCommons says it was motivated to create it by a need to help R&D in “varied areas of speech expertise.”

“Supporting broader pure language processing analysis for languages apart from English helps carry communication applied sciences to extra folks globally,” the group wrote in a weblog submit Thursday. “We anticipate a number of avenues for the analysis neighborhood to proceed to construct and develop, particularly within the areas of enhancing low-resource language speech fashions, enhanced speech recognition throughout completely different accents and dialects, and novel functions in speech synthesis.”

It’s an admirable aim, to make certain. However AI datasets like Unsupervised Folks’s Speech can carry dangers for the researchers who select to make use of them.

Biased knowledge is a kind of dangers. The recordings in Unsupervised Folks’s Speech got here from Archive.org, the nonprofit maybe greatest identified for the Wayback Machine internet archival device. As a result of lots of Archive.org’s contributors are English-speaking — and American — virtually the entire recordings in Unsupervised Folks’s Speech are in American-accented English, per the readme on the official undertaking web page.

That signifies that, with out cautious filtering, AI methods like speech recognition and voice synthesizer fashions skilled on Unsupervised Folks’s Speech might exhibit among the identical prejudices. They may, for instance, wrestle to transcribe English spoken by a non-native speaker, or have hassle producing artificial voices in languages apart from English.

Unsupervised Folks’s Speech may also comprise recordings from folks unaware that their voices are getting used for AI analysis functions — together with industrial functions. Whereas MLCommons says that every one recordings within the dataset are public area or accessible underneath Inventive Commons licenses, there’s the chance errors had been made.

In accordance with an MIT evaluation, a whole bunch of publicly accessible AI coaching datasets lack licensing info and comprise errors. Creator advocates together with Ed Newton-Rex, the CEO of AI ethics-focused nonprofit Pretty Educated, have made the case that creators shouldn’t be required to “decide out” of AI datasets due to the onerous burden opting out imposes on these creators.

“Many creators (e.g. Squarespace customers) haven’t any significant manner of opting out,” Newton-Rex wrote in a submit on X final June. “For creators who can decide out, there are a number of overlapping opt-out strategies, that are (1) extremely complicated and (2) woefully incomplete of their protection. Even when an ideal common opt-out existed, it might be vastly unfair to place the opt-out burden on creators, provided that generative AI makes use of their work to compete with them — many would merely not understand they may decide out.”

MLCommons says that it’s dedicated to updating, sustaining, and enhancing the standard of Unsupervised Folks’s Speech. However given the potential flaws, it’d behoove builders to train critical warning.