OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

Must Read
bicycledays
bicycledayshttp://trendster.net
Please note: Most, if not all, of the articles published at this website were completed by Chat GPT (chat.openai.com) and/or copied and possibly remixed from other websites or Feedzy or WPeMatico or RSS Aggregrator or WP RSS Aggregrator. No copyright infringement is intended. If there are any copyright issues, please contact: bicycledays@yahoo.com.

A discrepancy between first- and third-party benchmark outcomes for OpenAI’s o3 AI mannequin is elevating questions in regards to the firm’s transparency and mannequin testing practices.

When OpenAI unveiled o3 in December, the corporate claimed the mannequin might reply simply over a fourth of questions on FrontierMath, a difficult set of math issues. That rating blew the competitors away β€” the next-best mannequin managed to reply solely round 2% of FrontierMath issues accurately.

β€œIn the present day, all choices on the market have lower than 2% [on FrontierMath],” Mark Chen, chief analysis officer at OpenAI, stated throughout a livestream. β€œWe’re seeing [internally], with o3 in aggressive test-time compute settings, we’re in a position to recover from 25%.”

Because it seems, that determine was doubtless an higher certain, achieved by a model of o3 with extra computing behind it than the mannequin OpenAI publicly launched final week.

Epoch AI, the analysis institute behind FrontierMath, launched outcomes of its unbiased benchmark assessments of o3 on Friday. Epoch discovered that o3 scored round 10%, properly under OpenAI’s highest claimed rating.

That doesn’t imply OpenAI lied, per se. The benchmark outcomes the corporate revealed in December present a lower-bound rating that matches the rating Epoch noticed. Epoch additionally famous its testing setup doubtless differs from OpenAI’s, and that it used an up to date launch of FrontierMath for its evaluations.

β€œThe distinction between our outcomes and OpenAI’s is perhaps because of OpenAI evaluating with a extra highly effective inside scaffold, utilizing extra test-time [computing], or as a result of these outcomes have been run on a unique subset of FrontierMath (the 180 issues in frontiermath-2024-11-26 vs the 290 issues in frontiermath-2025-02-28-private),” wrote Epoch.

In accordance with a put up on X from the ARC Prize Basis, a company that examined a pre-release model of o3, the general public o3 mannequin β€œis a unique mannequin […] tuned for chat/product use,” corroborating Epoch’s report.

β€œAll launched o3 compute tiers are smaller than the model we [benchmarked],” wrote ARC Prize. Usually talking, greater compute tiers might be anticipated to realize higher benchmark scores.

OpenAI’s personal Wedna Zhou, a member of the technical workers, stated throughout a livestream final week that the o3 in manufacturing is β€œextra optimized for real-world use circumstances” and pace versus the model of o3 demoed in December. In consequence, it might exhibit benchmark β€œdisparities,” he added.

β€œ[W]e’ve accomplished [optimizations] to make the [model] extra price environment friendly [and] extra helpful,” Zhou stated. β€œWe nonetheless hope that β€” we nonetheless suppose that β€” it is a a lot better mannequin.”

Granted, the truth that the general public launch of o3 falls in need of OpenAI’s testing guarantees is a little bit of a moot level, because the firm’s o3-mini-high and o4-mini fashions outperform o3 on FrontierMath, and OpenAI plans to debut a extra highly effective o3 variant, o3-pro, within the coming weeks.

It’s, nonetheless, one other reminder that AI benchmarks are greatest not taken at face worth β€” significantly when the supply is an organization with providers to promote.

Benchmarking β€œcontroversies” have gotten a standard prevalence within the AI business as distributors race to seize headlines and mindshare with new fashions.

In January, Epoch was criticized for ready to reveal funding from OpenAI till after the corporate introduced o3. Many lecturers who contributed to FrontierMath weren’t knowledgeable of OpenAI’s involvement till it was made public.

Extra not too long ago, Elon Musk’s xAI was accused of publishing deceptive benchmark charts for its newest AI mannequin, Grok 3. Simply this month, Meta admitted to touting benchmark scores for a model of a mannequin that differed from the one the corporate made obtainable to builders.

Up to date 4:21 p.m. Pacific: Added feedback from Wedna Zhou, a member of the OpenAI technical workers.

Latest Articles

Most AI chatbots devour your user data – these are the...

Like many individuals in the present day, chances are you'll flip to AI to reply questions, generate content material,...

More Articles Like This