I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

Final week, I bought an e-mail from Anthropic asserting that Claude 3.5 Sonnet was accessible. In line with the AI firm, “Claude 3.5 Sonnet raises the trade bar for intelligence, outperforming competitor fashions and Claude 3 Opus on a variety of evaluations.”

The corporate added: “Claude 3.5 Sonnet is good for complicated duties like code technology.” I made a decision to see if that was true.

I am going to topic the brand new Claude 3.5 Sonnet mannequin to my commonplace set of coding exams — exams I’ve run in opposition to a variety of AIs with a variety of outcomes. Wish to observe together with your personal exams? Level your browser to How I take a look at an AI chatbot’s coding capacity – and you may too, which comprises all the usual exams I apply, explanations of how they work, and what to search for within the outcomes.

OK, let’s dig into the outcomes of every take a look at and see how they examine to earlier exams utilizing Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Superior, and ChatGPT.

1. Writing a WordPress plugin

At first, this appeared to have a lot promise. Let’s begin with the person interface Claude 3.5 Sonnet created primarily based on my take a look at immediate.

That is the primary time an AI has determined to place the 2 information fields side-by-side. The structure is clear and appears nice.

Claude additionally determined to do one thing else I’ve by no means seen an AI do. This plugin might be created utilizing simply PHP code, which is the code operating on the again finish of a WordPress server.

However some AI implementations additionally have added JavaScript code (which runs within the browser to regulate dynamic person interface options) and CSS code (which controls how the browser shows data).

In a PHP surroundings, for those who want PHP, JavaScript, and CSS, you possibly can both embody the CSS and JavaScript proper within the PHP code (that is a characteristic of PHP), or you possibly can put the code in three separate information — one for PHP, one for JavaScript, and one for CSS.

Often, when an AI desires to make use of all three languages, it exhibits what must be reduce and pasted into the PHP file, then one other block to be reduce and pasted right into a JavaScript file, after which a 3rd block to be reduce and pasted right into a CSS file.

However Claude simply offered one PHP file after which, when it ran, auto-generated the JavaScript and CSS information into the plugin’s house listing. That is each pretty spectacular and considerably wrong-headed. It is cool that it tried to make the plugin creation course of simpler, however whether or not or not a plugin can write to its personal folder relies on the settings of the OS configuration — and there is a very excessive likelihood it may fail.

I allowed it in my testing surroundings, however I would by no means permit a plugin to rewrite its personal code in a manufacturing surroundings. That is a really critical safety flaw.

Regardless of the pretty artistic nature of Claude’s code technology resolution, the underside line is that the plugin failed. Urgent the Randomize button does completely nothing. That is unhappy as a result of, as I stated, it had a lot promise.

Listed below are the combination outcomes of this and former exams:

Claude 3.5 Sonnet: Interface: good, performance: fail
ChatGPT GPT-4o: Interface: good, performance: good
Microsoft Copilot: Interface: enough, performance: fail
Meta AI: Interface: enough, performance: fail
Meta Code Llama: Full failure
Google Gemini Superior: Interface: good, performance: fail
ChatGPT 4: Interface: good, performance: good
ChatGPT 3.5: Interface: good, performance: good

2. Rewriting a string perform

This take a look at is designed to judge how the AI does rewriting code to work extra appropriately for the given want; on this case — {dollars} and cents conversions.

The Claude 3.5 Sonnet revision correctly eliminated main zeros, ensuring that entries like “000123” are handled as “123”. It correctly permits integers and decimals with as much as two decimal locations (which is the important thing repair the immediate requested for). It prevents destructive values. And it is good sufficient to return “0” for any bizarre or surprising enter, which prevents the code from abnormally ending in an error.

One failure is that it will not permit decimal values alone to be entered. So if the person entered 50 cents as “.50” as an alternative of “0.50”, it could fail the entry. Primarily based on how the unique textual content description for the take a look at is written, it ought to have allowed this enter type.

Though a lot of the revised code labored, I’ve to depend this as a fail as a result of if the code had been pasted right into a manufacturing venture, customers wouldn’t be capable to enter inputs that contained solely values for cents.

Listed below are the combination outcomes of this and former exams:

Claude 3.5 Sonnet: Failed
ChatGPT GPT-4o: Succeeded
Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Succeeded
Google Gemini Superior: Failed
ChatGPT 4: Succeeded
ChatGPT 3.5: Succeeded

3. Discovering an annoying bug

The large problem of this take a look at is that the AI is tasked with discovering a bug that is not apparent and — to unravel appropriately — requires platform information of the WordPress platform. It is also a bug I didn’t instantly see by myself and, initially, requested ChatGPT to unravel (which it did).

Claude not solely bought this proper — catching the subtlety of the error and correcting it — nevertheless it was additionally the primary AI since I printed the complete set of exams on-line to catch the truth that the publishing course of launched an error into the pattern question (which I subsequently mounted and republished).

Listed below are the combination outcomes of this and former exams:

Claude 3.5 Sonnet: Succeeded
ChatGPT GPT-4o: Succeeded
Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
Meta AI: Succeeded
Meta Code Llama: Failed
Google Gemini Superior: Failed
ChatGPT 4: Succeeded
ChatGPT 3.5: Succeeded

Thus far, we’re at two out of three fails. Let’s transfer on to our final take a look at.

4. Writing a script

This take a look at is designed to see how far the AI’s programming information goes into specialised programming instruments. Whereas AppleScript is pretty widespread for scripting on Macs, Keyboard Maestro is a industrial utility bought by a lone programmer in Australia. I discover it indispensable, nevertheless it’s simply certainly one of many such apps on the Mac.

Nevertheless, when testing in ChatGPT, ChatGPT knew tips on how to “communicate” Keyboard Maestro in addition to AppleScript, which exhibits how broad its programming language information is.

Sadly, Claude doesn’t have that information. It did write an AppleScript that tried to talk to Chrome (that is a part of the take a look at parameter) nevertheless it ignored the important Keyboard Maestro element.

Worse, it generated code in AppleScript that may generate a runtime error. In an try to ignore case for the match within the take a look at, Claude generated the road:

if theTab's title comprises enter ignoring case then

That is just about a double error as a result of the “comprises” assertion is case insensitive and the phrase “ignoring case” doesn’t belong the place it was positioned. It precipitated the script to error out with an “Ignoring cannot go after this” syntax error message.

Listed below are the combination outcomes of this and former exams:

Claude 3.5 Sonnet: Failed
ChatGPT GPT-4o: Succeeded however with reservations
Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Failed
Google Gemini Superior: Succeeded
ChatGPT 4: Succeeded
ChatGPT 3.5: Failed

Total outcomes

Listed below are the general outcomes of the 5 exams:

I used to be considerably bummed about Claude 3.5 Sonnet. The corporate particularly promised that this model was suited to programming. However as you possibly can see, not a lot. It isn’t that it might probably’t program. It simply cannot program appropriately.

I hold on the lookout for an AI that may greatest the ChatGPT options, particularly as platform and programming surroundings distributors begin to combine these different fashions straight into the programming course of. However, for now, I am going again to ChatGPT after I want programming assist, and that is my recommendation to you as properly.

Have you ever used an AI that can assist you program? Which one? How did it go? Tell us within the feedback under.

You’ll be able to observe my day-to-day venture updates on social media. You should definitely subscribe to my weekly replace publication, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.