In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors

A brand new examine examines how massive language fashions carry out in a wide range of medical contexts, together with actual emergency room instances — the place a minimum of one mannequin gave the impression to be extra correct than human docs.

The examine was revealed this week in Science and comes from a analysis workforce led by physicians and laptop scientists at Harvard Medical College and Beth Israel Deaconess Medical Heart. The researchers stated they performed a wide range of experiments to measure how OpenAI’s fashions in comparison with human physicians.

In a single experiment, researchers centered on 76 sufferers who got here into the Beth Israel emergency room, evaluating the diagnoses supplied by two inside medication attending physicians to these generated by OpenAI’s o1 and 4o fashions. These diagnoses had been assessed by two different attending physicians, who didn’t know which of them got here from people and which got here from AI.

“At every diagnostic touchpoint, o1 both carried out nominally higher than or on par with the 2 attending physicians and 4o,” the examine stated, including that the variations “had been particularly pronounced on the first diagnostic touchpoint (preliminary ER triage), the place there’s the least data accessible in regards to the affected person and probably the most urgency to make the proper resolution.”

In Harvard Medical College’s press launch in regards to the examine, the researchers emphasised that they didn’t “pre-process the information in any respect” — the AI fashions had been offered with the identical data that was accessible within the digital medical data on the time of every prognosis.

With that data, the o1 mannequin managed to supply “the precise or very shut prognosis” in 67% of triage instances, in comparison with one doctor who had the precise or shut prognosis 55% of the time, and to the opposite who hit the mark 50% of the time.

“We examined the AI mannequin towards nearly each benchmark, and it eclipsed each prior fashions and our doctor baselines,” stated Arjun Manrai, who heads an AI lab at Harvard Medical College and is without doubt one of the examine’s lead authors, within the press launch.

Techcrunch occasion

San Francisco, CA
|
October 13-15, 2026

To be clear, the examine didn’t declare that AI is able to make actual life-or-death choices within the emergency room. As an alternative, it stated the findings present an “pressing want for potential trials to guage these applied sciences in real-world affected person care settings.”

The researchers additionally famous that they solely studied how fashions carried out when supplied with text-based data, and that “current research counsel that present basis fashions are extra restricted in reasoning over nontext inputs.”

Adam Rodman, a Beth Israel physician who’s additionally one of many examine’s lead authors, warned the Guardian that there’s “no formal framework proper now for accountability” round AI diagnoses, and that sufferers nonetheless “need people to information them by way of life or demise choices [and] to information them by way of difficult remedy choices.”

In a put up in regards to the examine, Kristen Panthagani, an emergency doctor, stated that is an “an attention-grabbing AI examine that has led to some very overhyped headlines,” particularly because it was evaluating AI diagnoses to these from inside medication physicians, not ER physicians.

“If we’re going to check AI instruments to physicians’ scientific capability, we must always begin by evaluating to physicians who really observe that specialty,” Panthagani stated. “I’d not be shocked if a LLM may beat a dermatologist at an neurosurgery board examination, [but] that’s not a very useful factor to know.”

She additionally argued, “As an ER physician seeing a affected person for a primary time, my main purpose is not to guess your final prognosis. My main purpose is to find out in case you have a situation that would kill you.”

This put up and headline have been up to date to mirror the truth that the diagnoses within the examine got here from inside medication attending physicians, and to incorporate commentary from Kristen Panthagani.

Whenever you buy by way of hyperlinks in our articles, we could earn a small fee. This doesn’t have an effect on our editorial independence.