My 8 ChatGPT Agent tests produced only 1 near-perfect result – and a lot of alternative facts

Final week, OpenAI unveiled Agent, its new device that mixes the capabilities of Deep Analysis and Operator. Operator was OpenAI’s first try at a computer-using mannequin, a mannequin that truly can open home windows and click on on consumer interface parts. ChatGPT Agent can do this and extra.

Proper now, ChatGPT Agent is simply out there for $200/mo Professional tier subscribers and gives for 400 agent interactions per thirty days. When the $20/mo Plus tier positive factors entry to Agent, which needs to be as we speak, these customers will get 40 interactions per thirty days.

(Disclosure: Ziff Davis, ZDNET’s guardian firm, filed an April 2025 lawsuit towards OpenAI, alleging it infringed Ziff Davis copyrights in coaching and working its AI methods.)

I upgraded my plan from Plus to Professional simply so I may check out the brand new Agent mode and report again to you. On this article, I am going to present you detailed outcomes from eight complete checks.

TL;DR check outcomes

Earlier than we go into the detailed checks, I am going to begin with some total TL;DR observations.

Take a look at rely: Up to now two days, I used 25 of the out there 400 queries, for a complete of virtually 12 hours of hyper-uber-supercomputer use. No surprise this factor prices $200/month.

Practically each question required a follow-on, so when it comes time for Plus customers, do not assume you can provide Agent 40 initiatives. Extra probably, you will be giving it 20-25, and utilizing the remainder of your queries to persuade the Agent to observe instructions.

End result high quality: In all my checks, Agent appeared to grasp the issue. However it failed to provide helpful outcomes for a lot of the checks. That stated, the ultimate check produced outcomes that may solely be characterised as amazingly helpful.

Mission scale: Agent cannot deal with huge initiatives, the type of knowledge evaluation initiatives you really need an AI to have the ability to deal with. It has bother scrolling via net pages. It might probably’t go to websites which have AI or robots.txt restrictions in place. And lengthy processing exceeds session time allocations, even with the tremendous top-of-the-line gold-pressed latinum Professional version.

Presentation high quality: One of many main pitch factors for Agent is its means to create spreadsheets and shows. It did okay with spreadsheets, however the graphic high quality of the shows was fairly tough. I anticipate this to alter over time, however do not anticipate Agent to make shows you should utilize with out appreciable cleanup.

Accuracy: AIs hallucinate. The OpenAI group cautioned about utilizing Agent due to the brand new dangers concerned. Whereas I did get again some outcomes that had been correct, Agent additionally got here again with unforced errors, outcomes it may have simply examined and deemed inaccurate. However no such verification or validation occurred. That stated, the ultimate check was correct and exhibits what this tech can do when it really works.

Connectors: Agent comes with the power to make use of connectors (through API calls) to hyperlink to Gmail, Google Calendar, Google Drive, Outlook, Dropbox, and extra. I didn’t check out the connectors due to how typically Agent hallucinates or does one thing pretty boneheaded. I simply did not really feel snug sufficient to provide Skynet entry to my accounts. At the least, not but.

Limits: I used to be unable to make use of Agent within the MacOS app. I additionally discovered that Agent stalled arduous once I tried to run it in a number of Chrome tabs directly. For now, you launch an Agent course of and wait. It is not like Codex, the place you possibly can launch a bunch of initiatives and are available again later and harvest all the outcomes. However since that functionality exists in Codex, I am positive it is going to present up quickly in Agent.

That ought to provide you with a reasonably good overview. Let’s get began trying on the eight check outcomes. For every consequence, I’ve included a hyperlink to the session recording, so you possibly can see the prompts I used, the detailed outcomes, and watch Agent purpose its approach via the issue.

Also, positively learn to the top. A few of the early outcomes are pretty unhealthy, however the final one knocks it out of the park. And with that, right here we go.

1. Deciding on merchandise on Amazon

Understanding of the issue: Strong
Execution: Each good and unhealthy
Hallucination: Bizarre church reference, faux Amazon hyperlinks
Processing time: 20 + 12 minutes

When OpenAI launched ChatGPT Agent, the group demoed how they used the device to buy wedding ceremony garments and a marriage present. That appeared like a reasonably unusual and impractical software for a super-intelligence, particularly since present registries exist and are broadly used.

As an alternative, I gave Agent a buying venture I had truly extensively researched and accomplished just a few months earlier. I am operating Energy-over-Ethernet cables all throughout my yard to improve my safety system. As such, I am creating numerous customized cables. I already know that doing so requires some key instruments: a cutter to slice the cable, a cable finish stripper, a crimper to connect the RJ-45 ends, and a tester to substantiate that lengthy cable runs work.

I gave Agent a immediate asking for 3 configurations: a finances toolset, a “money-is-no-object” resolution, and a candy spot resolution. I requested for hyperlinks, product descriptions, and product pictures.

When you give Agent your immediate, it creates a digital desktop. You may watch it conducting its actions, leaping between a desktop view, a textual content view, and code.

The finances resolution turned out to be a win. Agent discovered a single $34 equipment with every little thing I requested for. It offered a hyperlink, and even reasoning why it selected that resolution. Sadly, the picture it offered was nothing just like the precise equipment.

The mid-tier and top-tier options had been lower than good. Not one of the hyperlinks labored. The mid-tier candy spot resolution did have a product-accurate picture, however with out a hyperlink, it wasn’t actually useful.

Sadly, the mannequin really helpful would not truly exist on Amazon. The truth is, not one of the mid- or upper-tier merchandise exist on Amazon. It seems to be like Agent did a pile of net browsing to seek out the merchandise, disregarding my directions to go looking solely on Amazon.

It additionally clearly visited different websites, most likely gathering mannequin names and descriptions.

Then, when it packaged up its ultimate suggestions, it simply assigned random Amazon hyperlinks to the outline, despite the fact that these merchandise and people hyperlinks do not appear to exist on Amazon.

I did request it return and check out once more. When it did, after 12 minutes, it offered a lot of the similar merchandise, though one of many hyperlinks that had failed earlier did, in reality, level to a product on Amazon within the second run.

I am unable to depart this part with out declaring one thing simply plain bizarre. As I used to be watching Agent work, it offered this in its desktop view. I do not even need to know.

You may watch a replay of your entire session right here.

2. Evaluating egg costs

Understanding of the issue: Strong
Execution: Did what I requested
Hallucination: My fault for imprecise prompting
Processing time: 14 minutes

In discussing ChatGPT Agent, OpenAI confirmed a slide that talked about Instacart as one of many examples that the chatbot is snug working with. Since my household repeatedly makes use of Instacart, I made a decision to set Agent free and see what it may inform me about egg costs at our native shops.

I did not let Agent have entry to my account, however I shared my ZIP code right here in Salem, Oregon. I advised it to “Please go to all of the grocery shops on Instacart and evaluate egg costs.”

It did precisely that. You’ve got heard the phrase Rubbish In, Rubbish Out. Effectively, that is what occurs once you ask an AI to take a look at “all of the grocery shops.” I ought to have requested it to look in a 5 or 10 mile radius solely. However I did not.

Agent got here again with 21 shops, starting from close by to as much as nearly 47 miles away. It did accomplish what I requested, evaluating egg costs. With out prompting, it determined to rank the eggs by value. This was good. However when it selected the eggs to rank, it did not at all times select the least costly product from every retailer.

For instance, it really helpful the Good & Collect eggs from Goal at $2.99 a dozen, somewhat than the $1.99/dozen Market Pantry egg, additionally from Goal.

You may watch a replay of your entire session right here.

3. Making a PowerPoint slide

Understanding of the issue: Strong
Execution: Added the right knowledge level
Hallucination: Was unable to breed graphic high quality
Processing time: 10 minutes

Subsequent up is a venture I did early final week. With Congress specializing in Bitcoin, my editor requested me to replace my Bitcoin funding article, the place I have been monitoring the worth of a $50 Bitcoin funding since 2022.

The worth of my holdings went up, which implies I wanted so as to add a brand new slide. Every slide provides a date worth on the X axis and a worth level on the Y axis. From a PowerPoint fiddling standpoint, that meant shifting over the graphics to make room for the brand new worth and, on this case, adjusting the vertical scale to accommodate a considerable rise in worth.

After I did it, it took me about 45 minutes. Since OpenAI stated that PowerPoint was one in all ChatGPT Agent’s strengths, I wished to see if Agent may save me that point sooner or later.

I uploaded my current slide deck minus the final slide I made for the article. Then I requested Agent to create that slide for me.

Because it labored, the desktop view confirmed the terminal interface. You may see how Agent is placing collectively the code to generate a graphic picture.

This is what that slide ought to have seemed like (observe: foreshadowing).

This is what Agent gave me.

To be honest, Agent clearly understood the issue. It moved the present knowledge factors over to the left to make room for the brand new node. It additionally positioned the brand new Bitcoin merchandise correctly in relation to the present ones, and added each value and share change textual content blocks.

Meaning Agent learn and understood the context of my PowerPoint deck’s structure. That, in and of itself, could be very spectacular.

However it failed on including extra scale strains and new Y-axis values. It failed on reproducing the fonts. It failed on correctly putting the textual content blocks. And it pushed your entire graphic up and to the left of the slide.

I am guessing the graphics library that Agent makes use of is not actually as much as the duty of constructing wonderful graphic modifications. That may undoubtedly enhance over time.

You may watch a replay of your entire session right here.

4. Article categorization (methodology II)

Understanding of the issue: Strong
Execution: Failed as a result of exceeding allowable session time
Hallucination: Gave me again partial outcomes
Processing time: 8 minutes + 3 minutes + 21 minutes

Every week for the previous two years, I’ve revealed a publication that shares with followers the articles I revealed right here on ZDNET for the week. Every publication accommodates a title, hyperlink, and article description.

By pointing Agent to my again subject archive, it will have near 300 article summaries to categorize.

Sadly, Agent bumped into plenty of issues of its personal making. It was unable to efficiently scroll via the article listing utilizing JavaScript. After I advised it to make use of the net interface, it began to, nevertheless it reported, “Sadly, I’ve reached the top of the allotted shopping periods for this process, which implies I am unable to discover additional pages and accumulate the extra knowledge at the moment.”

Bear in mind, I am paying $200 a month for OpenAI’s finest plan, and it nonetheless will not give me sufficient time to search for 300 articles. That is a gotcha, proper there. It is also disappointing as a result of a process like scrolling again via an article archive and doing a little tabulating is precisely the type of process you would possibly give to an assistant. If the AI offers up as a result of it takes too lengthy, then we will not actually depend on AI for all of the assistant kind issues. Nobody desires a fussy, choosy assistant.

In any case, Agent did give me again a spreadsheet and a slide primarily based on the restricted knowledge it was capable of finding earlier than my little request exceeded the hourly energy finances for the Metropolis of Las Vegas (or so I think about).

You may watch a replay of your entire session right here.

5. Extract remembered textual content from video

Understanding of the issue: Partial
Execution: Did not return full transcript on first run, right on second run
Hallucination: Determined to do what it wished on first run
Processing time: 2 minutes

I watch numerous YouTube movies to reinforce my studying and analysis. Plus nothing beats a great stress-free video about how pavers are made. Whereas it is pretty simple to get a transcript of a full video, whether or not immediately from YouTube or utilizing Apple Voice Memos, finding the place in a video a phase you need to discover can take time.

This is an instance. When OpenAI launched Agent in a video, CEO Sam Altman mentioned among the cautions and warnings about utilizing ChatGPT Agent mode. I did bear in mind they had been close to the top of the video, however I did not need to spend time sifting via to get the precise quotes.

As an alternative, I delegated that task to Agent. On its first run, it discovered the phase simply sufficient, however as a substitute of returning a word-for-word transcript, it returned some quotes, interspersed with its personal evaluation.

I clarified what I wished and, on its second run, it gave me precisely what I wanted. On this case, although, it wasn’t that my immediate was unclear. I simply needed to insist a second time that I wished a transcript for the AI to do what I requested.

Sadly, this additional evaluate cycle diminished the time-saving worth to me. I nonetheless suppose utilizing Agent was quicker than if I sifted via the video myself. However I needed to assemble a second immediate and await a second consequence, all of which took my time.

Nonetheless, this can be a useful device.

You may watch a replay of your entire session right here.

6. Making a pattern evaluation presentation

Understanding of the issue: Strong
Execution: Good, aside from slide visible high quality
Hallucination: An excessive amount of knowledge to substantiate or deny assertions
Processing time: 32 minutes

As a part of my job, it is essential to have the ability to sustain with ongoing tech and enterprise developments. As such, I typically spend days in deep dives, coming in control on new matters.

I wished to see if ChatGPT Agent may save me a while by making ready a report and a full presentation on distant work developments. I advised it that the PowerPoint was destined for my administration group, so it needs to be complete and professional-looking.

It returned an evaluation doc similar to the outcomes we have been getting from ChatGPT deep analysis. The report accommodates numerous assertions and statistical claims, most of which I haven’t got time to analysis for affirmation.

A lot of the top-level conclusions are congruent with my understanding of present work-from-home developments. That stated, we’re conversant in the mannequin’s propensity for hallucination, so I might be very involved about utilizing any of this knowledge professionally with out further vetting.

Agent did produce a 17-slide PowerPoint deck that was organized fairly effectively. As with earlier experiments, the graphic technology high quality was a bit off. The primary slide truly seems to be fairly good.

However later within the deck, it would not look proper. Discover how the next slide has graphics on prime of textual content, and bullets in entrance of bullets on prime of empty bullets.

Within the following slide, not solely is the textual content operating off the top of the web page, however there is no legend. As such, it is not clear what’s represented by purple and by blue.

As soon as once more, you possibly can see how Python is used to assemble the deck.

Agent does a good job, so I am pretty assured that the AI will get higher over time. Programmatic building of slides primarily based on templates just isn’t a brand new know-how. I simply do not suppose OpenAI prioritized slide presentation aesthetics as a part of this launch.

You may watch a replay of your entire session right here.

7. Vetting a presentation for accuracy

Understanding of the issue: Strong
Execution: Good
Hallucination: Appears full, nevertheless it’s nonetheless from an AI
Processing time: 11 minutes + 7 minutes

Effectively, this was simply plain enjoyable. I made a decision to provide the presentation created within the earlier check to a brand new recent ChatGPT Agent session and requested it to validate the claims.

Agent concluded, “A number of quantitative claims—particularly these regarding productiveness/innovation impacts, the scale and development of the gig economic system, charges of facet‑gig participation, and the affect of politics and tradition—couldn’t be verified with accessible proof throughout this evaluate.”

Agent offered an in depth evaluation of every assertion. I’ve summarized the outcomes under.

Adoption timeline: Largely confirmed
International comparability: Confirmed
Workforce composition: Confirmed
Migration: Confirmed
Mobility of distant employees: Confirmed
Housing & native economies: Confirmed
Workplace emptiness & environmental impacts: Largely confirmed
Social connections & wellbeing: Partly confirmed
Employer attitudes & return‑to‑workplace mandates: Largely confirmed
Worker preferences & pay cuts: Largely confirmed
Productiveness & innovation: Partly confirmed
Gig economic system & freelancing: Unverified
Freelancing motivations & challenges: Not strictly factual claims
Aspect gigs & a number of jobs: Unverified
Demographics & fairness: Partly confirmed / blended
Political & cultural influences: Partly confirmed / principally unverified
Different elements & coverage panorama: Usually correct however qualitative

As you possibly can see, of the 17 knowledge factors, Agent thought-about solely 5 to be totally confirmed. Distinction this with how GPT-4o analyzed the outcomes. When GPT-4o was given the identical PowerPoint deck, it thought-about all assertions to be confirmed. You may see GPT-4o’s detailed outcomes right here.

Although I used the AI to validate the AI, I most likely would not be snug utilizing any of the presumed info in my work with out private, Mark I Eyeball affirmation. Nonetheless, it was a enjoyable train, and engaging to see how totally different the outcomes had been between ChatGPT Agent and ChatGPT 4o.

You may watch a replay of your entire session right here.

8. Analyze constructing code for fence set up

Understanding of the issue: Strong
Execution: Fairly near good
Hallucination: None. It bought all however one graphic good
Processing time: 4 minutes

Again once we lived in Palm Bay, Florida, we lived on a nook property. The home got here with what may solely charitably be referred to as a fence. We wanted to switch it, and since we wished privateness, we wished to see simply how a lot fence we may legally set up.

Over the course of a few years, I spent a ton of time going forwards and backwards with the planning workplace in an effort to each perceive what I may do with a fence, and what different alternate options could be out there to me.

Since I’ve numerous historical past with this venture and am very conversant in Palm Bay codes (even years after shifting away), I made a decision to level ChatGPT Agent on the downside.

It took all of 4 minutes to supply an in depth, correct evaluation. It even created working diagrams that illustrated the choices. Primarily based on my expertise, I do know the outcomes to be correct.

ChatGPT Agent produced output that may very well be used to take this venture to the following step. Again once I lived in Palm Bay, the equal most likely took me 20 calls, a ton of emails, and some visits to Metropolis Corridor to give you choices. The extent of presentation and group I got here up with wasn’t even shut.

If Agent can up its sport elsewhere to be on a par with this check, then it is going to have some legs.

You may watch a replay of your entire session right here.

What’s all of it imply?

Effectively, it positive as heck is not sentient but. At finest, it is like that administrative assistant you employed as a result of your mother stated you needed to rent her cousin’s unemployable slacker child. There are occasional flashes of brilliance, however principally the output looks as if the results of each aggressively following instructions and purposely inventing various info.

Is it value $200/month for the Professional program? Not for Agent. At the least not but. Agent is unreliable and customarily performs pretty poorly. In a yr or so, I am positive it is going to get higher. However now? No. The one purpose to spend $200 a month on it’s to do what I am doing: testing it to see the place the know-how is as we speak.

Keep tuned, as a result of regardless of all of the inaccuracies and downside areas, this positively exhibits the place AI know-how may go. In fact, if an internet shopping AI Agent is the longer term, and all of the content material websites on the market block it as a result of AI is stealing our content material, then we’ll have a really fascinating downside.

It is early days, people. Whether or not this can be a know-how that will probably be a boon to all humanity or a know-how that destroys the web and kills us in our sleep stays to be seen.

However hey, within the meantime, I and the remainder of the ZDNET group will probably be making an attempt to make sense of all of it for you. So hold coming again. We’ll have extra to let you know. I will be tinkering with Agent and I am positive I am going to have extra to say as effectively.

Have you ever tried ChatGPT Agent but? If that’s the case, did it observe your directions precisely or veer off into its personal interpretation of the duty? Did it hallucinate or hit the mark? How do you’re feeling about giving AI instruments entry to your information, accounts, or browser? Are you seeing extra worth in this type of automation, or are you continue to ready for it to change into helpful? Tell us within the feedback under.

You may observe my day-to-day venture updates on social media. Be sure you subscribe to my weekly replace publication, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

My 8 ChatGPT Agent tests produced only 1 near-perfect result – and a lot of alternative facts

TL;DR check outcomes

1. Deciding on merchandise on Amazon

2. Evaluating egg costs

3. Making a PowerPoint slide

4. Article categorization (methodology II)

5. Extract remembered textual content from video

6. Making a pattern evaluation presentation

7. Vetting a presentation for accuracy

8. Analyze constructing code for fence set up

What’s all of it imply?

Related Posts:

Why my Raspberry Pi boards suddenly cost as much as a...

Can orbital data centers help justify a massive valuation for SpaceX?

How I beat the $4 gas average in 2026: These 5...

Copilot is ‘for entertainment purposes only,’ according to Microsoft’s terms of...

I customized an Arch-based distro my way in under 5 minutes...

More Articles Like This

Topics

Stay connected

Legal Pages

Top Tags List

About Us