Yikes! Microsoft Copilot failed every single one of my coding tests

AI News

Yikes! Microsoft Copilot failed every single one of my coding tests

bicycledays

April 29, 2024

Yikes! Microsoft Copilot failed every single one of my coding tests

Just lately, my ZDNET colleague and fellow AI explorer Sabrina Ortiz wrote an article entitled, 7 causes I exploit Copilot as a substitute of ChatGPT. I had by no means been terribly impressed with Copilot, particularly because it failed some fact-checking checks I ran in opposition to it final 12 months. However Sabrina made some actually good factors about the advantages of Microsoft’s providing, so I assumed I would give it one other attempt.

To be clear, as a result of Microsoft names all the pieces Copilot, the Copilot I am testing is the general-purpose chatbot. There’s a GitHub model of Copilot, however that runs as an extension inside Visible Studio Code and is on the market for a month-to-month or yearly payment. I didn’t take a look at GitHub Copilot.

As an alternative, I loaded my customary set of 4 checks and fed them into the chatbot model of Copilot.

To recap, here’s a description of the checks I am utilizing:

Writing a WordPress plugin: This checks fundamental internet improvement, utilizing the PHP programming language, within WordPress. It additionally requires a little bit of consumer interface constructing. If an AI chatbot passes this take a look at, it may assist create rudimentary code as an assistant to internet builders. I initially documented this take a look at in “I requested ChatGPT to put in writing a WordPress plugin I wanted. It did it in lower than 5 minutes.”
Rewriting a string perform: This take a look at evaluates how an AI chatbot updates a utility perform for higher performance. If an AI chatbot passes this take a look at, it would have the ability to assist create instruments for programmers. If it fails, first-year programming college students can most likely do a greater job. I initially documented this take a look at in “OK, so ChatGPT simply debugged my code. For actual.”
Discovering an annoying bug: This take a look at requires intimate information of how WordPress works as a result of the apparent reply is improper. If an AI chatbot can reply this accurately, then its information base is fairly full, even with frameworks like WordPress. I initially documented this take a look at in “OK, so ChatGPT simply debugged my code. For actual.”
Writing a script: This take a look at asks an AI chatbot to program utilizing two pretty specialised programming instruments not recognized to many customers. It basically checks the AI chatbot’s information past the large languages. I initially documented this take a look at in “Google unveils Gemini Code Help and I am cautiously optimistic it’ll assist programmers.”

Let’s dig into the outcomes of every take a look at and see how they evaluate to earlier checks utilizing Meta AI, Meta Code Llama, Google Gemini Superior, and ChatGPT.

1. Writing a WordPress plugin

Here is Copilot’s consequence on the left and the ChatGPT consequence on the correct.

In contrast to ChatGPT, which styled the fields to look uniform, Copilot left that as an train for the consumer, stating “Bear in mind to regulate the styling and error dealing with as wanted.”

To check, I inserted a set of names. After I clicked Randomized Traces, I bought nothing again within the consequence area.

A have a look at the code confirmed some attention-grabbing programming errors, indicating that Copilot did not actually know easy methods to write code for WordPress. For instance, it assigned the hook supposed to course of the shape to the admin_init motion. That is not one thing that may trigger the shape to course of, it is what initializes the admin interface.

It additionally did not have code to truly show the randomized strains. It does retailer them in a worth, however it does not retrieve and show them. The duplicate verify was partially appropriate in that it did kind names collectively, however it did not evaluate names to one another, so duplicates have been nonetheless allowed.

Copilot is outwardly utilizing a extra superior LLM (GPT-4) than the free giant language mannequin I ran these checks on with the free model of ChatGPT (GPT-3.5), and but the outcomes of ChatGPT nonetheless appear to be higher. I discover {that a} bit baffling.

Listed here are the combination outcomes of this and former checks:

Microsoft Copilot: Interface: ample, performance: fail
Meta AI: Interface: ample, performance: fail
Meta Code Llama: Full failure
Google Gemini Superior: Interface: good, performance: fail
ChatGPT: Interface: good, performance: good

2. Rewriting a string perform

This take a look at is designed to check {dollars} and cents conversions. Whereas the Copilot-generated code does correctly flag an error if a worth containing a letter or multiple decimal level is shipped to it, it does not carry out an entire validation.

For instance, it permits for main zeroes. It additionally permits for greater than two digits to the correct of the decimal level.

Whereas it does correctly generate errors for the extra egregious entry errors, the values it permits as appropriate may trigger subsequent routines to fail, in the event that they’re anticipating a strict {dollars} and cents worth.

If a pupil turned this in as an project, I’d give it a C. But when programmers in the true world are counting on Copilot to generate code that will not trigger failures down the road, what Copilot generated is simply not adequate. I’ve to provide it a fail.

Listed here are the combination outcomes of this and former checks:

Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Succeeded
Google Gemini Superior: Failed
ChatGPT: Succeeded

3. Discovering an annoying bug

Nicely, that is new. Okay, first, let me again up and put this take a look at into context. This checks the AI’s capacity to suppose a couple of chess strikes forward. The reply that appears apparent is not the correct reply. I bought caught by that after I was initially debugging the problem that finally turned this take a look at.

ChatGPT, a lot to my very nice shock on the time, noticed by way of the “trick” of the issue and accurately recognized what the code was doing improper. To take action, it needed to see not simply what the code itself mentioned, however the way it behaved based mostly on the way in which the WordPress API labored. Like I mentioned, I used to be fairly shocked that ChatGPT might be that subtle.

Copilot, properly, not a lot. Copilot suggests I verify the spelling of my perform title and the WordPress hook title. The WordPress hook is a printed factor, so it ought to have the ability to affirm, as I did, that it was spelled accurately. And my perform is my perform, so I can spell it nevertheless I would like. If I had misspelled it someplace within the code, the IDE would have very visibly pointed it out.

It additionally fairly fortunately repeated the issue assertion to me, suggesting I clear up it. That is what I requested it to do, and it turned it again to me, telling me the issue I informed it, after which telling me it might work if I debugged it. Then, it ended with “take into account searching for help from the plugin developer or neighborhood boards. 😊” — and yeah, that emoji was a part of the AI’s response.

Listed here are the combination outcomes of this and former checks:

Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
Meta AI: Succeeded
Meta Code Llama: Failed
Google Gemini Superior: Failed
ChatGPT: Succeeded

4. Writing a script

I would not initially have tried this take a look at on an AI, however I had tried it on a lark with ChatGPT and it figured it out. So did Gemini Superior.

The concept with this take a look at is that it asks a few pretty obscure Mac scripting instrument referred to as Keyboard Maestro, in addition to Apple’s scripting language AppleScript, and Chrome scripting habits. For the file, Keyboard Maestro is among the single largest causes I exploit Macs over Home windows for my each day productiveness, as a result of it permits your entire OS and the varied functions to be reprogrammed to go well with my wants. It is that highly effective.

In any case, to move the take a look at, the AI has to correctly describe easy methods to clear up the issue utilizing a mixture of Keyboard Maestro code, AppleScript code, and Chrome API performance. Persevering with its development, Copilot did not do it proper. It utterly ignored Keyboard Maestro (I am guessing it is not in its dataset).

Within the generated AppleScript, the place I requested it to simply scan the present window, Copilot repeated the method for all home windows, returning outcomes for the improper window (the final one within the chain).

Listed here are the combination outcomes of this and former checks:

Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Failed
Google Gemini Superior: Succeeded
ChatGPT: Succeeded

Total outcomes

Listed here are the general outcomes of the 5 checks:

The outcomes right here actually stunned me. It has been about 5 months since I final examined Copilot in opposition to different AIs. I totally anticipated Microsoft to have labored out the bugs. I anticipated that Copilot would do as properly, or even perhaps higher than, ChatGPT. In any case, Microsoft is a large investor in OpenAI (makers of ChatGPT) and Copilot is predicated on the identical language mannequin as ChatGPT.

And but, it failed spectacularly, turning within the worst rating of any of the AI’s I’ve tried by not passing a single coding take a look at. Not one. The final time I examined Copilot, I attempted performing some fact-checking utilizing all of the AIs. All the opposite AIs answered the query and gave again pretty usable outcomes. Copilot returned the info I requested it to confirm, which was just like the habits I discovered in Check 3 above.

I am not impressed. In actual fact, I discover the outcomes from Microsoft’s flagship AI providing to be somewhat demoralizing. It needs to be so a lot better. Ah properly, Microsoft does enhance its merchandise over time. Perhaps by subsequent 12 months.

Have you ever tried coding with Copilot, Meta AI, Gemini, or ChatGPT? What has your expertise been? Tell us within the feedback beneath.

You’ll be able to observe my day-to-day challenge updates on social media. Be sure you subscribe to my weekly replace e-newsletter, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.