Did xAI lie about Grok 3’s benchmarks?

Must Read
bicycledays
bicycledayshttp://trendster.net
Please note: Most, if not all, of the articles published at this website were completed by Chat GPT (chat.openai.com) and/or copied and possibly remixed from other websites or Feedzy or WPeMatico or RSS Aggregrator or WP RSS Aggregrator. No copyright infringement is intended. If there are any copyright issues, please contact: bicycledays@yahoo.com.

Debates over AI benchmarks — and the way they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI worker accused Elon Musk’s AI firm, xAI, of publishing deceptive benchmark outcomes for its newest AI mannequin, Grok 3. One of many co-founders of xAI, Igor Babushkin, insisted that the corporate was in the suitable.

The reality lies someplace in between.

In a submit on xAI’s weblog, the corporate printed a graph exhibiting Grok 3’s efficiency on AIME 2025, a group of difficult math questions from a latest invitational arithmetic examination. Some consultants have questioned AIME’s validity as an AI benchmark. However, AIME 2025 and older variations of the check are generally used to probe a mannequin’s math capability.

xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing out there mannequin, o3-mini-high, on AIME 2025. However OpenAI staff on X have been fast to level out that xAI’s graph didn’t embrace o3-mini-high’s AIME 2025 rating at “cons@64.”

What’s cons@64, you may ask? Effectively, it’s brief for “consensus@64,” and it principally offers a mannequin 64 tries to reply every downside in a benchmark and takes the solutions generated most incessantly as the ultimate solutions. As you may think about, cons@64 tends to spice up fashions’ benchmark scores fairly a bit, and omitting it from a graph may make it seem as if one mannequin surpasses one other when in actuality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — that means the primary rating the fashions bought on the benchmark — fall under o3-mini-high’s rating. Grok 3 Reasoning Beta additionally trails ever-so-slightly behind OpenAI’s o1 mannequin set to “medium” computing. But xAI is promoting Grok 3 because the “world’s smartest AI.”

Babushkin argued on X that OpenAI has printed equally deceptive benchmark charts previously — albeit charts evaluating the efficiency of its personal fashions. A extra impartial social gathering within the debate put collectively a extra “correct” graph exhibiting practically each mannequin’s efficiency at cons@64:

However as AI researcher Nathan Lambert identified in a submit, maybe an important metric stays a thriller: the computational (and financial) value it took for every mannequin to realize its greatest rating. That simply goes to point out how little most AI benchmarks talk about fashions’ limitations — and their strengths.

Latest Articles

How Stack Overflow is adding value to human answers in the...

The question-and-answer web site Stack Overflow was based 17 years in the past to permit programmers -- human programmers...

More Articles Like This