I put OpenAI’s o1-preview through my 4 AI coding tests. It surprised me (in a good way)

Normally, when a software program firm pushes out a significant new launch in Could, they do not attempt to high it with one other main new launch 4 months later. However there’s nothing common in regards to the tempo of innovation within the AI enterprise.

Though OpenAI dropped its new omni-powerful GPT-4o mannequin in mid-Could, the corporate has been busy. Way back to final November, Reuters revealed a rumor that OpenAI was engaged on a next-generation language mannequin, then referred to as Q*. They doubled down on that report in Could, stating that Q* was being labored on below the code identify of Strawberry.

Strawberry, because it seems, is definitely a mannequin referred to as o1-preview, which is obtainable now as an choice to ChatGPT Plus subscribers. You may select the mannequin from the choice dropdown:

As you may think, if there is a new ChatGPT mannequin obtainable, I’ll put it via its paces. And that is what I am doing right here.

The brand new Strawberry mannequin focuses on reasoning, breaking down prompts and issues into steps. OpenAI showcases this method via a reasoning abstract that may be displayed earlier than every reply.

When o1-preview is requested a query, it does some considering after which shows how lengthy it took to try this considering. For those who toggle the dropdown, you may see some reasoning. This is an instance from one in every of my coding exams:

It is good that the AI knew sufficient so as to add error dealing with, however I discover it attention-grabbing that o1-preview categorizes that step below “Regulatory compliance”.

I additionally found the o1-preview mannequin offers extra exposition after the code. In my first take a look at, which created a WordPress plugin, the mannequin supplied explanations of the header, class construction, admin menu, admin web page, logic, safety measures, compatibility, set up directions, working directions, and even take a look at knowledge. That is much more info than was supplied by earlier fashions.

However actually, the proof is within the pudding. Let’s put this new mannequin via our commonplace exams and see how effectively it really works.

1. Writing a WordPress plugin

This simple coding take a look at requires information of the PHP programming language and the WordPress framework. The problem asks the AI to jot down each interface code and purposeful logic, with the twist being that as a substitute of eradicating duplicate entries, it has to separate the duplicate entries, so they are not subsequent to one another.

The o1-preview mannequin excelled. It offered the UI first as simply the entry subject:

As soon as the information was entered, and Randomize Strains was clicked, the AI generated an output subject with correctly randomized output knowledge. You may see how Abigail Williams is duplicated, and in compliance with the take a look at directions, each entries are usually not listed side-by-side:

In my exams of different LLMs, solely 4 of the ten fashions handed this take a look at. The o1-preview mannequin accomplished this take a look at completely.

2. Rewriting a string operate

Our second take a look at fixes a string common expression that was a bug reported by a consumer. The unique code was designed to check if an entered quantity was legitimate for {dollars} and cents. Sadly, the code solely allowed integers (so 5 was allowed, however not 5.25).

The o1-preview LLM rewrote the code efficiently. The mannequin joined 4 of my earlier LLM exams within the winners’ circle.

3. Discovering an annoying bug

This take a look at was created from a real-world bug I had issue resolving. Figuring out the basis trigger requires information of the programming language (on this case PHP) and the nuances of the WordPress API.

The error messages supplied weren’t technically correct. The error messages referenced the start and the tip of the calling sequence I used to be operating, however the bug was associated to the center a part of the code.

I wasn’t alone in struggling to resolve the issue. Three of the opposite LLMs I examined could not establish the basis reason for the issue and beneficial the extra apparent (however incorrect) resolution of adjusting the start and ending of the calling sequence.

The o1-preview mannequin supplied the right resolution. In its rationalization, the mannequin additionally pointed to the WordPress API documentation for the capabilities I used incorrectly, offering an added useful resource to be taught why it had made its advice. Very useful.

4. Writing a script

This problem requires the AI to combine information of three separate coding spheres, the AppleScript language, the Chrome DOM (how an online web page is structured internally), and Keyboard Maestro (a specialty programming software from a single programmer).

Answering this query requires an understanding of all three applied sciences, in addition to how they need to work collectively.

As soon as once more, o1-preview succeeded, becoming a member of solely three of the opposite 10 LLMs which have solved this downside.

A really chatty chatbot

The brand new reasoning method for o1-preview definitely does not diminish ChatGPT’s means to ace our programming exams. The output from my preliminary WordPress plugin take a look at, particularly, appeared to operate as a extra refined piece of software program than earlier variations.

It is nice that ChatGPT offers reasoning steps in the beginning of its work and a few explanatory knowledge on the finish. Nonetheless, the reasons may be chatty. I requested o1-preview to jot down “Hey world” in C#, the canonical take a look at line in programming. That is how GPT-4o responded:

And that is how o1-preview responded to the identical take a look at:

I imply, wow, proper? That is lots of chat from ChatGPT. You may as well flip the reasoning dropdown and get much more info:

All of this info is nice, but it surely’s lots of textual content to filter via. I choose a concise rationalization, with further info choices in dropdowns faraway from the principle reply.

But ChatGPT’s o1-preview mannequin carried out excellently. I stay up for how effectively it should work when built-in extra absolutely with the GPT-4o options, corresponding to file evaluation and internet entry.

Have you ever tried coding with o1-preview? What have been your experiences? Tell us within the feedback beneath.

You may comply with my day-to-day venture updates on social media. Make sure you subscribe to my weekly replace publication, and comply with me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.