I tested DeepSeek’s R1 and V3 coding skills – and we’re not all doomed (yet)

DeepSeek exploded into the world’s consciousness this previous weekend. It stands out for 3 highly effective causes:

It is an AI chatbot from China, fairly than the US
It is open supply.
It makes use of vastly much less infrastructure than the large AI instruments we have been taking a look at.

Given the US authorities’s issues over TikTok and potential Chinese language authorities involvement in that code, a brand new AI rising from China is sure to generate consideration. ZDNET’s Radhika Rajkumar did a deep dive into these points in her article Why China’s DeepSeek may burst our AI bubble.

On this article, we’re avoiding politics. As a substitute, I am placing each DeepSeek V3 and DeekSeek R1 via the identical set of AI coding assessments I’ve thrown at 10 different giant language fashions. In keeping with DeepSeek itself:

Select V3 for duties requiring depth and accuracy (e.g., fixing superior math issues, producing complicated code).
Select R1 for latency-sensitive, high-volume functions (e.g., buyer help automation, fundamental textual content processing).

You’ll be able to select between R1 and V3 by clicking the little button within the chat interface. If the button is blue, you are utilizing R1.

The brief reply is that this: spectacular, however clearly not good. Let’s dig in.

Take a look at 1: Writing a WordPress plugin

This take a look at was really my first take a look at of ChatGPT’s programming prowess, manner again within the day. My spouse wanted a plugin for WordPress that will assist her run an involvement machine for her on-line group.

Her wants had been pretty easy. It wanted to absorb an inventory of names, one title per line. It then needed to type the names, and if there have been duplicate names, separate them so that they weren’t listed side-by-side.

I did not actually have time to code it for her, so I made a decision to offer the AI the problem on a whim. To my large shock, it labored.

Since then, it has been my first take a look at for AIs when evaluating their programming expertise. It requires the AI to know find out how to arrange code for the WordPress framework and observe prompts clearly sufficient to create each the consumer interface and program logic.

Solely about half of the AIs I’ve examined can totally move this take a look at. Now, nonetheless, we are able to add yet another to the winner’s circle.

DeepSeek V3 created each the consumer interface and program logic precisely as specified. As for DeepSeek R1, properly that is an fascinating case. The “reasoning” side of R1 induced the AI to spit out 4502 phrases of research earlier than sharing the code.

The UI seemed completely different, with a lot wider enter areas. Nonetheless, each the UI and logic labored, so R1 additionally passes this take a look at.

To date, DeepSeek V3 and R1 each handed certainly one of 4 assessments.

Take a look at 2: Rewriting a string perform

A consumer complained that he was unable to enter {dollars} and cents right into a donation entry subject. As written, my code solely allowed {dollars}. So, the take a look at includes giving the AI the routine that I wrote and asking it to rewrite it to permit for each {dollars} and cents

Normally, this leads to the AI producing some common expression validation code. DeepSeek did generate code that works, though there may be room for enchancment. The code that DeepSeek V2 wrote was unnecessarily lengthy and repetitious whereas the reasoning earlier than producing the code in R1 was additionally very lengthy.

My greatest concern is that each fashions of the DeepSeek validation ensures validation as much as 2 decimal locations, but when a really giant quantity is entered (like 0.30000000000000004), the usage of parseFloat would not have specific rounding information. The R1 mannequin additionally used JavaScript’s Quantity conversion with out checking for edge case inputs. If dangerous information comes again from an earlier a part of the common expression or a non-string makes it into that conversion, the code would crash.

It is odd, as a result of R1 did current a really good record of assessments to validate in opposition to:

So right here, we’ve got a cut up determination. I am giving the purpose to DeepSeek V3 as a result of neither of those points its code produced would trigger this system to interrupt when run by a consumer and would generate the anticipated outcomes. Then again, I’ve to offer a fail to R1 as a result of if one thing that is not a string one way or the other will get into the Quantity perform, a crash will ensue.

And that offers DeepSeek V3 two wins out of 4, however DeepSeek R1 just one win out of 4 to date.

Take a look at 3: Discovering an annoying bug

This can be a take a look at created once I had a really annoying bug that I had issue monitoring down. As soon as once more, I made a decision to see if ChatGPT may deal with it, which it did.

The problem is that the reply is not apparent. Really, the problem is that there’s an apparent reply, primarily based on the error message. However the apparent reply is the improper reply. This not solely caught me, nevertheless it repeatedly catches a few of the AIs.

Fixing this bug requires understanding how particular API calls inside WordPress work, with the ability to see past the error message to the code itself, after which understanding the place to search out the bug.

Each DeepSeek V3 and R1 handed this one with almost similar solutions, bringing us to a few out of 4 wins for V3 and two out of 4 wins for R1. That already places DeepSeek forward of Gemini, Copilot, Claude, and Meta.

Will DeepSeek rating a house run for V3? Let’s discover out.

Take a look at 4: Writing a script

And one other one bites the mud. This can be a difficult take a look at as a result of it requires the AI to know the interaction between three environments: AppleScript, the Chrome object mannequin, and a Mac scripting device referred to as Keyboard Maestro.

I might have referred to as this an unfair take a look at as a result of Keyboard Maestro just isn’t a mainstream programming device. However ChatGPT dealt with the take a look at simply, understanding precisely what a part of the issue is dealt with by every device.

Sadly, neither DeepSeek V3 or R1 had this degree of data. Neither mannequin knew that it wanted to separate the duty between directions to Keyboard Maestro and Chrome. It additionally had pretty weak information of AppleScript, writing customized routines for AppleScript which are native to the language.

Weirdly, the R1 mannequin failed as properly as a result of it made a bunch of incorrect assumptions. It assumed {that a} entrance window all the time exists, which is unquestionably not the case. It additionally made the idea that the presently entrance operating program would all the time be Chrome, fairly than explicitly checking to see if Chrome was operating.

This leaves DeepSeek V3 with three appropriate assessments and one fail and DeepSeek R1 with two appropriate assessments and two fails.

Closing ideas

I discovered that DeepSeek’s insistence on utilizing a public cloud e mail deal with like gmail.com (fairly than my regular e mail deal with with my company area) was annoying. It additionally had a lot of responsiveness fails that made doing these assessments take longer than I might have preferred.

I wasn’t positive I might be capable to write this text as a result of, for many of the day, I obtained this error when making an attempt to enroll:

DeepSeek’s on-line companies have just lately confronted large-scale malicious assaults. To make sure continued service, registration is quickly restricted to +86 telephone numbers. Present customers can log in as common. Thanks to your understanding and help.

Then, I obtained in and was capable of run the assessments.

DeepSeek appears to be overly loquacious by way of the code it generates. The AppleScript code in Take a look at 4 was each improper and excessively lengthy. The common expression code in Take a look at 2 was appropriate in V3, nevertheless it may have been written in a manner that made it far more maintainable. It failed in R1.

I am undoubtedly impressed that DeepSeek V3 beat out Gemini, Copilot, and Meta. But it surely seems to be on the outdated GPT-3.5 degree, which suggests there’s undoubtedly room for enchancment. I used to be disillusioned with the outcomes for the R1 mannequin. Given the selection, I might nonetheless select ChatGPT as my programming code helper.

That stated, for a brand-new device operating on a lot decrease infrastructure than the opposite instruments, this may very well be an AI to observe.

What do you assume? Have you ever tried DeepSeek? Are you utilizing any AIs for programming help? Tell us within the feedback beneath.

You’ll be able to observe my day-to-day venture updates on social media. Make sure you subscribe to my weekly replace e-newsletter, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.