Benchmark illustrates fashions’ capabilities like coding and reasoning. ’s end result displays he mannequin’s efficiency over numerous domains out there on knowledge on agentic coding, math, reasoning, and gear use.
Benchmark | Claude 4 Opus | Claude 4 Sonnet | GPT-4o | Gemini 2.5 Professional |
HumanEval (Code Gen) | Not Accessible | Not Accessible | 74.8% | 75.6% |
GPQA (Graduate Reasoning) | 83.3% | 83.8% | 83.3% | 83.0% |
MMLU (World Information) | 88.8% | 86.5% | 88.7% | 88.6% |
AIME 2025 (Math) | 90.0% | 85.0% | 88.9% | 83.0% |
SWE-bench (Agentic Coding) | 72.5% | 72.7% | 69.1% | 63.2% |
TAU-bench (Software Use) | 81.4% | 80.5% | 70.4% | Not Accessible |
Terminal-bench (Coding) | 43.2% | 35.5% | 30.2% | 25.3% |
MMMU (Visible Reasoning) | 76.5% | 74.4% | 82.9% | 79.6% |
On this, Claude 4 typically excels in coding, GPT-4o in reasoning, and Gemini 2.5 Professional gives sturdy, balanced efficiency throughout completely different modalities. For extra info, please go to right here.
General Evaluation
Right here’s what we’ve realized about these superior closing fashions, primarily based on the above factors of comparability:
- We discovered that Claude 4 excels in coding, math, and gear use, however it is usually the most costly one.
- GPT-4o excels at reasoning and multimodal help, dealing with completely different enter codecs, making it an excellent alternative for extra superior and sophisticated assistants.
- In the meantime, Gemini 2.5 Professional gives a robust and balanced efficiency with the most important context window and probably the most cost-effective pricing.
Claude 4 vs GPT-4o vs Gemini 2.5 Professional: Coding Capabilities
Now we’ll examine the code-writing capabilities of Claude 4, GPT-4o, and Gemini 2.5 Professional. For that, we’re going to give the identical immediate to all three fashions and consider their responses on the next metrics:
- Effectivity
- Readability
- Remark and Documentation
- Error Dealing with
Job 1: Design Enjoying Playing cards with HTML, CSS, and JS
Immediate: “Create an interactive webpage that shows a set of WWE Famous person flashcards utilizing HTML, CSS, and JavaScript. Every card ought to characterize a WWE wrestler, and should embody a back and front facet. On the entrance, show the wrestler’s title and picture. On the again, present extra stats akin to their ending transfer, model, and championship titles. The flashcards ought to have a flip animation when hovered over or clicked.
Moreover, add interactive controls to make the web page dynamic: a button that shuffles the playing cards, and one other that exhibits a random card from the deck. The format needs to be visually interesting and responsive for various display sizes. Bonus factors should you embody sound results like entrance music when a card is flipped.
Key Options to Implement:
- Entrance of card: wrestler’s title + picture
- Again of card: stats (e.g., finisher, model, titles)
- Flip animation utilizing CSS or JS
- “Shuffle” button to randomly reorder playing cards
- “Present Random Famous person” button
- Responsive design.”
Claude 4’s Response:
GPT-4o’s Response:
Gemini 2.5 Professional’s Response:
Comparative Evaluation
Within the first job, Claude 4 gave probably the most interactive expertise with probably the most dynamic visuals. It additionally added a sound impact whereas clicking on the cardboard. GPT-4o gave a black theme format with clean transitions and absolutely practical buttons, however lacked the audio performance. In the meantime, Gemini 2.5 Professional gave the only and most elementary sequential format with no animation or sound. Also, the random card function on this one failed to indicate the cardboard’s face correctly. General, Claude takes the lead right here, adopted by GPT-4o, after which Gemini.
Job 2: Construct a Sport
Immediate: “Spell Technique Sport is a turn-based battle sport constructed with Pygame, the place two mages compete by casting spells from their spellbooks. Every participant begins with 100 HP and 100 Mana and takes turns choosing spells that deal harm, heal, or apply particular results like shields and stuns. Spells devour mana and have cooldown intervals, requiring gamers to handle sources and strategize fastidiously. The sport options an interesting UI with well being and mana bars, and spell cooldown indicators.. Gamers can face off towards one other human or an AI opponent, aiming to scale back their rival’s HP to zero by means of tactical choices.
Key Options:
- Flip-based gameplay with two mages (PvP or PvAI)
- 100 HP and 100 Mana per participant
- Spellbook with various spells: harm, therapeutic, shields, stuns, mana recharge
- Mana prices and cooldowns for every spell to encourage strategic play
- Visible UI components: well being/mana bars, cooldown indicators, spell icons
- AI opponent with easy tactical decision-making
- Mouse-driven controls with elective keyboard shortcuts
- Clear in-game messaging exhibiting actions and results”
Claude 4’s Response:
GPT-4o’s Response:
Gemini 2.5 Professional’s Response:
Comparative Evaluation
Within the second job, on the entire, not one of the fashions supplied correct graphics. Every one displayed a black display with a minimal interface. Nevertheless, Claude 4 provided probably the most practical and clean management over the sport, with a variety of assault, defence, and different strategic gameplay. GPT-4o, then again, suffered from efficiency points, akin to lagging, and a small and concise window dimension. Even Gemini 2.5 Professional fell quick right here, as its code did not run and gave some errors. General, as soon as once more, Claude takes the lead right here, adopted by GPT-4o, after which Gemini 2.5 Professional.
Job 3: Greatest Time to Purchase and Promote Inventory
Immediate: “You might be given an array costs the place costs[i] is the value of a given inventory on the ith day.
Discover the utmost revenue you possibly can obtain. You could full at most two transactions.
Word: You could not have interaction in a number of transactions concurrently (i.e., you need to promote the inventory before you purchase once more).
Instance:
Enter: costs = [3,3,5,0,0,3,1,4]
Output: 6
Clarification: Purchase on day 4 (value = 0) and promote on day 6 (value = 3), revenue = 3-0 = 3. Then purchase on day 7 (value = 1) and promote on day 8 (value = 4), revenue = 4-1 = 3.”
Claude 4’s Response:
GPT-4o’s Response:

Gemini 2.5 Professional’s Response:

Comparative Evaluation
Within the third and ultimate job, the fashions needed to resolve the issue utilizing dynamic programming. Among the three, GPT-4o provideed probably the most sensible and well-approached resolution, utilizing a clear 2D dynamic programming with protected initialization, and likewise embodyd check instances. Whereas Claude 4 presentd a extra detailed and academic strategy, it’s extra verbose. In the meantime, Gemini 2.5 Professional gave a concise technique, however used INT_MIN initialization, which is a dangerous strategy. So on this job, GPT-4o takes the lead, adopted by Claude 4, after which Gemini 2.5 Professional.
Closing Verdict: General Evaluation
Right here’s a comparative abstract of how properly every mannequin has carried out within the above duties.
Job | Claude 4 | GPT-4o | Gemini 2.5 Professional | Winner |
Job 1 (Card UI) | Most interactive with animations and sound results | Easy darkish theme with practical buttons, no audio | Fundamental sequential format, card face concern, no animation/sound | Claude 4 |
Job 2 (Sport Management) | Easy controls, broad technique choices, most practical sport | Usable however laggy, small window | Didn’t run, interface errors | Claude 4 |
Job 3 (Dynamic Programming) | Verbose however academic, good for studying | Clear and protected DP resolution with check instances, most sensible | Concise however unsafe (makes use of INT_MIN), lacks robustness | GPT-4o |
To test the entire model of all of the code recordsdata, please go to right here.
Conclusion
Now, by means of this complete comparability of three various duties, we have now noticed that Claude 4 stands out with its interactive UI design capabilities and steady logic in modular programming, making it the highest performer general. Whereas GPT-4o follows carefully with its clear and sensible coding, and excels in algorithmic downside fixing. In the meantime, Gemini 2.5 Professional lacks in UI design and stability in execution throughout all duties. However these observations are utterly primarily based on the above comparability, whereas every mannequin has distinctive strengths, and the selection of mannequin utterly is dependent upon the issue we try to unravel.
Login to proceed studying and luxuriate in expert-curated content material.