I retested Microsoft Copilot’s AI coding skills in 2025 and now it’s got serious game

There’s been a ton of buzz about how AIs can assist programming, however within the first yr or two of generative AI, a lot of that was hype. Microsoft ran large occasions celebrating how Copilot may enable you code, however after I put it to the check in April 2024, it failed all 4 of my standardized exams. It fully struck out. Crashed and burned. Fell off the cliff. It carried out the worst of any AI I examined.

Blended metaphors apart, let’s stick to baseball. Copilot traded its cleats for a bus cross. It was undeserving.

However time spent within the bullpen of life appears to have helped Copilot. This time, when it confirmed up for tryouts, it was warmed up and able to step into the field. It was throwing warmth within the bullpen. When it was time to play, it had its eye on the ball and its swing dialed in. Clearly, it was game-ready and on the lookout for a pitch to drive.

However may it stand up to my exams? With a squint in my eye, I stepped onto the pitcher’s mound and began off with a simple lob. Again in 2024, you possibly can really feel the wind as Copilot swung and missed. However now, in April 2025, Copilot related squarely with the ball and hit it straight and true.

We needed to ship Copilot down, nevertheless it fought its manner again to the present. Here is the play-by-play.

1. Writing a WordPress plugin

Nicely, Copilot definitely improved since its first run of this check in April 2024. The primary time, it did not present code to really show the randomized strains. It did retailer them in a price, nevertheless it did not retrieve and show them. In different phrases, it swung and missed. It did not produce any output.

That is the results of the newest run:

This time, the code labored. It did depart a random further clean line on the finish, however because it fulfilled the programming task, we’ll name it good.

Copilot’s unbroken streak of completely unmitigated programming failures has been damaged. Let’s examine the way it does in the remainder of the exams.

2. Rewriting a string perform

This check is designed to check {dollars} and cents conversions. In my first check again in April 20224, the Copilot-generated code did correctly flag an error if a price containing a letter or multiple decimal level is distributed to it, however did not carry out an entire validation. It allowed outcomes by that might have brought on subsequent routines to fail.

This run, nonetheless, did fairly properly. It performs a lot of the exams correctly. It returns false for numbers with greater than two digits to the appropriate of the decimal level, like 1.234 and 1.230. It additionally returns false for numbers with further main zeros. So 0.01 is allowed, however 00.01 is just not.

Technically, these values could possibly be transformed to usable foreign money values, nevertheless it’s by no means dangerous for a validation routine to be strict in its exams. The principle purpose is that the validation routine would not let a price by that might trigger a subsequent routine to crash. Copilot did good right here.

We’re now at two for 2, an enormous enchancment over its outcomes from its first run.

3. Discovering an annoying bug

I gotta let you know how Copilot first answered this again in April 2024, as a result of it is simply too good.

This exams the AI’s means to suppose a number of chess strikes forward. The reply that appears apparent is not the appropriate reply. I received caught by that after I was initially debugging the difficulty that ultimately turned this check.

On Copilot’s first run, it recommended I examine the spelling of my perform title and the WordPress hook title. The WordPress hook is a broadcast factor, so Copilot ought to have been in a position to affirm spelling. And my perform is my perform, so I can spell it nonetheless I need. If I had misspelled it someplace within the code, the IDE would have very visibly pointed it out.

And it received higher. Again then, Copilot additionally fairly fortunately repeated the issue assertion to me, suggesting I resolve the issue myself. Yeah, its total suggestion was that I debug it. Nicely, duh. Then, it ended with “take into account in search of assist from the plugin developer or neighborhood boards. 😊” — and yeah, that emoji was a part of the AI’s response.

It was a spectacular, enthusiastic, emojic failure. See what I imply? Early AI solutions, regardless of how ineffective, needs to be immortalized.

Particularly when Copilot wasn’t almost as a lot enjoyable this time. It simply solved it. Shortly, cleanly, clearly. Finished and finished. Solved.

That places Copilot at three-for-three and decisively strikes it out of the “do not use this instrument” class. Bases are loaded. Let’s examine if Copilot can rating a house run.

4. Writing a script

The concept with this check is that it asks a few pretty obscure Mac scripting instrument referred to as Keyboard Maestro, in addition to Apple’s scripting language AppleScript, and Chrome scripting conduct. For the document, Keyboard Maestro is likely one of the single largest causes I exploit Macs over Home windows for my each day productiveness, as a result of it permits the complete OS and the varied functions to be reprogrammed to go well with my wants. It is that highly effective.

In any case, to cross the check, the AI has to correctly describe find out how to resolve the issue utilizing a mixture of Keyboard Maestro code, AppleScript code, and Chrome API performance.

Again within the day, Copilot did not do it proper. It fully ignored Keyboard Maestro (on the time, it most likely wasn’t in its data base). Within the generated AppleScript, the place I requested it to only scan the present window, Copilot repeated the method for all home windows, returning outcomes for the fallacious window (the final one within the chain).

However not now. This time, Copilot did it proper. It did precisely what was requested, received the appropriate window and tab, correctly talked to Keyboard Maestro and Chrome, and used precise AppleScript syntax for the AppleScript.

Bases loaded. Dwelling run.

Total outcomes

Final yr, I mentioned I wasn’t impressed. In actual fact, I discovered the outcomes somewhat demoralizing. However I additionally mentioned this:

Ah properly, Microsoft does enhance its merchandise over time. Perhaps by subsequent yr.

Prior to now yr, Copilot went from strikeouts to scoreboard shaker. It went from batting cleanup within the basement to chasing a pennant below the lights.

What about you? Have you ever taken Copilot or one other AI coding assistant out to the sphere currently? Do you suppose it is lastly prepared for the massive leagues, or is it nonetheless using the bench? Have you ever had any strikeouts or dwelling runs utilizing AI for growth? And what would it not take for one in all these instruments to earn a spot in your beginning lineup? Tell us within the feedback under.

You may observe my day-to-day mission updates on social media. Make sure to subscribe to my weekly replace publication, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.