Meta exec denies the company artificially boosted Llama 4’s benchmark scores

A Meta exec on Monday denied a rumor that the corporate educated its new AI fashions to current properly on particular benchmarks whereas concealing the fashions’ weaknesses.

The chief, Ahmad Al-Dahle, VP of generative AI at Meta, mentioned in a submit on X that it’s “merely not true” that Meta educated its Llama 4 Maverick and Llama 4 Scout fashions on “check units.” In AI benchmarks, check units are collections of knowledge used to judge the efficiency of a mannequin after it’s been educated. Coaching on a check set might misleadingly inflate a mannequin’s benchmark scores, making the mannequin seem extra succesful than it really is.

Over the weekend, an unsubstantiated rumor that Meta artificially boosted its new fashions’ benchmark outcomes started circulating on X and Reddit. The rumor seems to have originated from a submit on a Chinese language social media website from a person claiming to have resigned from Meta in protest over the corporate’s benchmarking practices.

Reviews that Maverick and Scout carry out poorly on sure duties fueled the rumor, as did Meta’s determination to make use of an experimental, unreleased model of Maverick to attain higher scores on the benchmark LM Enviornment. Researchers on X have noticed stark variations within the conduct of the publicly downloadable Maverick in contrast with the mannequin hosted on LM Enviornment.

Al-Dahle acknowledged that some customers are seeing “combined high quality” from Maverick and Scout throughout the totally different cloud suppliers internet hosting the fashions.

“Since we dropped the fashions as quickly as they had been prepared, we anticipate it’ll take a number of days for all the general public implementations to get dialed in,” Al-Dahle mentioned. “We’ll preserve working by means of our bug fixes and onboarding companions.”