Which AI agent is the best? This new leaderboard can tell you

What’s higher than an AI chatbot that may carry out duties for you when prompted? AI that may do duties for you by itself.

AI brokers are the most recent frontier within the AI house. AI firms are racing to construct their very own fashions, and choices are continuously rolling out to enterprises. However which AI agent is one of the best?

Galileo Leaderboard

On Wednesday, Galileo launched an Agent Leaderboard on Hugging Face, an open-source AI platform the place customers can construct, practice, entry, and deploy AI fashions. The leaderboard is supposed to assist folks find out how AI brokers carry out in real-world enterprise functions and assist groups decide which agent most closely fits their wants.

📊 Our Agent Leaderboard is 𝗹𝗶𝘃𝗲! We constructed a complete benchmark of which LLMs work greatest for AI Brokers 👀
After evaluating 17 main LLMs throughout 14 numerous datasets, we’re excited to share our findings about which fashions actually excel at tool-calling—and are able to… pic.twitter.com/Cgw2iWNSA7

— 🔭 Galileo (@rungalileo) February 12, 2025

On the leaderboard, you will discover details about a mannequin’s efficiency, together with its rank and rating. At a look, it’s also possible to see extra primary details about the mannequin, together with vendor, price, and whether or not it is open supply or personal.

The leaderboard at present options “the 17 main LLMs,” together with fashions from Google, OpenAI, Mistral, Anthropic, and Meta. It’s up to date month-to-month to maintain up with ongoing releases, which have been occurring regularly.

How fashions are ranked

To find out the outcomes, Galileo makes use of benchmarking datasets, together with the BFCL (Berkeley Operate Calling Leaderboard), τ-bench (Tau benchmark), Xlam, and ToolACE, which check completely different agent capabilities. The leaderboards then flip this knowledge into an analysis framework that covers real-world use circumstances.

“BFCL excels in tutorial domains like arithmetic, leisure, and training, τ-bench makes a speciality of retail and airline situations, xLAM covers knowledge technology throughout 21 domains, and ToolACE focuses on API interactions in 390 domains,” explains the corporate in a weblog submit.

Galileo provides that every mannequin is stress-tested to measure all the things from easy API calls to extra superior duties resembling multi-tool interactions. The corporate additionally shared its methodology, reassuring customers that it makes use of a standardized methodology to judge all AI brokers pretty. The submit features a extra technical dive into the mannequin rating.

The rankings

Google’s Gemini-2.0 flash is in first place, adopted intently by OpenAI’s GPT-4o. Each of those fashions obtained what Galileo calls “Elite Tier Efficiency” standing, which is given to fashions with a rating of .9 or increased. Google and OpenAI dominated the leaderboard with their personal fashions, taking the primary six positions.

Google’s Gemini 2.0 was constant throughout the entire analysis classes and balanced spectacular consistency efficiency throughout all classes with cost-effectiveness, based on the submit, at a value of $0.15/$0.6 per million tokens. Though GPT-4o was an in depth second, it has a a lot increased worth level at $2.5/$10 per million tokens.

Within the “high-performance phase,” the class under the elite tier, Gemini-1.5-Flash got here in third place, and Gemini-1.5-Professional in fourth. OpenAI’s reasoning fashions, o1 and o3-mini, adopted in fifth and sixth place, respectively.

Mistral-small-2501 was the primary open-sourced AI mannequin to chart. Its rating of .832 positioned it within the “mid-tier capabilities” class. The evaluations discovered its strengths to be its sturdy long-context dealing with and power choice capabilities.

The right way to entry

To view the outcomes, you may go to the Agent Leaderboard on Hugging Face. Along with the usual leaderboard, it is possible for you to to filter the leaderboard by whether or not the LLM is open-sourced or personal. and by class, which refers back to the functionality being examined (total, lengthy context, composite, and many others).