EleutherAI releases massive AI training dataset of licensed and open domain text

EleutherAI, an AI analysis group, has launched what it claims is among the largest collections of licensed and open-domain textual content for coaching AI fashions.

The dataset, known as the Widespread Pile v0.1, took round two years to finish in collaboration with AI startups Poolside, Hugging Face, and others, together with a number of tutorial establishments. Weighing in at 8 terabytes in dimension, the Widespread Pile v0.1 was used to coach two new AI fashions from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims carry out on par with fashions developed utilizing unlicensed, copyrighted information.

AI firms, together with OpenAI, are embroiled in lawsuits over their AI coaching practices, which depend on scraping the online — together with copyrighted materials like books and analysis journals — to construct mannequin coaching datasets. Whereas some AI firms have licensing preparations in place with sure content material suppliers, most preserve that the U.S. authorized doctrine of honest use shields them from legal responsibility in circumstances the place they educated on copyrighted work with out permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI firms, which the group says has harmed the broader AI analysis subject by making it extra obscure how fashions work and what their flaws could be.

“[Copyright] lawsuits haven’t meaningfully modified information sourcing practices in [model] coaching, however they’ve drastically decreased the transparency firms have interaction in,” Stella Biderman, EleutherAI’s government director, wrote in a weblog put up on Hugging Face early Friday. “Researchers at some firms we’ve got spoken to have additionally particularly cited lawsuits as the rationale why they’ve been unable to launch the analysis they’re doing in extremely data-centric areas.”

The Widespread Pile v0.1, which might be downloaded from Hugging Face’s AI dev platform and GitHub, was created in session with authorized specialists, and it attracts on sources, together with 300,000 public area books digitized by the Library of Congress and the Web Archive. EleutherAI additionally used Whisper, OpenAI’s open supply speech-to-text mannequin, to transcribe audio content material.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are proof that the Widespread Pile v0.1 was curated fastidiously sufficient to allow builders to construct fashions aggressive with proprietary options. In response to EleutherAI, the fashions, each of that are 7 billion parameters in dimension and had been educated on solely a fraction of the Widespread Pile v0.1, rival fashions like Meta’s first Llama AI mannequin on benchmarks for coding, picture understanding, and math.

Parameters, generally known as weights, are the inner parts of an AI mannequin that information its habits and solutions.

“Typically, we expect that the widespread concept that unlicensed textual content drives efficiency is unjustified,” Biderman wrote in her put up. “As the quantity of accessible overtly licensed and public area information grows, we are able to count on the standard of fashions educated on overtly licensed content material to enhance.”

The Widespread Pile v0.1 seems to be partially an effort to proper EleutherAI’s historic wrongs. Years in the past, the corporate launched The Pile, an open assortment of coaching textual content that features copyrighted materials. AI firms have come underneath fireplace — and authorized strain — for utilizing The Pile to coach fashions.

EleutherAI is committing to releasing open datasets extra steadily going ahead in collaboration with its analysis and infrastructure companions.

Up to date 9:48 a.m. Pacific: Biderman clarified in a put up on X that EleutherAI contributed to the discharge of the datasets and fashions, however that their growth concerned many companions, together with the College of Toronto, which helped lead the analysis.