
What was published
On April 30, 2026, researchers at Harvard Medical School, Beth Israel Deaconess Medical Center, Stanford University, and other institutions published a study in Science — one of the world's most prestigious peer-reviewed journals — titled "Performance of a large language model on the reasoning tasks of a physician".
The study tested a preview version of OpenAI's o1-series model across six experiments designed to measure the kind of thinking doctors do every day: generating a list of possible diagnoses, choosing the right test to order, estimating how likely a disease is, and deciding on a treatment plan. Across all experiments, the AI generally met or beat the performance of human physicians — sometimes by wide margins.
In the most striking experiment, the AI model correctly identified the exact or near diagnosis in 67% of emergency room triage cases, compared to 55% and 50% for the two attending physicians assessed.
The model was then assessed by two other attending physicians, who did not know which results came from humans and which from AI. The results favoured the AI. The study's lead author, Arjun Manrai, assistant professor of biomedical informatics at Harvard Medical School, said, "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines.
How the experiment was run
The researchers put the AI model through a series of experiments to evaluate its clinical acumen — including real-life cases such as a patient with lupus who had previously visited Beth Israel's emergency department in Boston. The AI scanned the medical records and suspected a history of lupus, an autoimmune condition which can lead to heart inflammation, could explain what was really ailing the patient. It was correct.
In one experiment, researchers focused on 76 patients who came into the Beth Israel emergency room, comparing the diagnoses offered by two internal medicine attending physicians to those generated by OpenAI's o1 and 4o models.
When challenged with 377 contemporary clinical cases, o3 – OpenAI's most recent reasoning model – ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline. Next-test selection accuracy reached 98%.
Critically, the AI model relied solely on text for its diagnosis — the same electronic health records available to the human doctors, with no preprocessing or additional information. It had no advantage from imaging, physical examination, or any of the other inputs a clinician typically uses. And it still outperformed the physicians. Dr. David Reich, chief clinical officer for Mount Sinai Health System in New York, who was not involved in the research, said, "This paper is a beautiful summary of just how much things have improved."
What the researchers said it does not mean
This is a study that demands to be read carefully — because what it does not say matters as much as what it does.
The study stopped well short of advocating clinical deployment, calling instead for formal prospective trials. The study authors noted that the AI model relied solely on text for diagnosis, while real-life clinicians have to consider other inputs like images, sounds, and nonverbal cues.
One of the paper's co-senior authors, Rodman, said: "As generative AI tools like chatbots are heavily marketed — both to patients and clinicians — it makes him worried that the science experiments, all based on simulated and historical cases, will be misconstrued as proof of AI's safety and efficacy when used to treat real patients.'
Critics, including emergency physician Kristen Panthagani, cautioned that comparing AI to non-specialist physicians and equating diagnostic guessing with genuine emergency care represented a significant methodological limitation.
The researchers are right to be careful. A diagnosis made from a text record is not the same as the care delivered at a bedside. The doctor who correctly diagnoses lupus also holds the patient's hand, explains what this means for their life, speaks to a frightened family, and makes dozens of judgement calls that no dataset captures. These remain human. For now. And possibly always.
But none of that changes what the paper actually found. The reasoning task — reading available information and arriving at the most likely diagnosis — is something an AI now does better than an experienced physician. That is the finding. Published in Science. Replicated across six experiments. Led by Harvard and Stanford.
What this means for the world children are growing up into
Medicine is not the only field where this pattern is emerging. It is simply the field where this week's most important evidence landed. The same trajectory — AI performing expert-level reasoning on complex, high-stakes tasks — is visible in law, in software engineering, in scientific research, and in financial analysis. The Harvard study is not an isolated data point. It is part of a consistent and accelerating pattern.
The question this raises for schools is the one we find ourselves returning to every week: what does it mean to prepare a child for a world where AI can now out-reason experts in some of the most demanding cognitive tasks humans have ever developed?
There are two inadequate answers. The first is to pretend it is not happening — to carry on preparing students for a world of professional expertise as if the Harvard study had not been published. The second is to panic — to conclude that expertise is worthless and human judgement obsolete and to stop teaching rigorous thinking altogether.
Both miss the point.
The Harvard study did not find that AI replaces doctors. It found that AI out-reasons them on a specific, text-based diagnostic task. The physician who correctly identifies lupus from electronic records is one part of what medicine actually is. The physician who sits with the patient diagnosed with lupus, explains what the coming years will look like, helps them understand their options, and remains present through a process that is not just informational but deeply human — that physician is irreplaceable. Not because AI cannot process the information. But because presence, empathy, moral responsibility, and genuine care are not information problems.
This distinction is the heart of everything AI Ready School is built around. Cypher does not give students answers — it asks them to think. Not because thinking faster than AI is the goal. It is not. The goal is developing students who understand what they are thinking about deeply enough to direct AI, evaluate its outputs, catch its errors, and take responsibility for the decisions made with its assistance. The doctor who uses AI as a diagnostic partner — who knows enough to trust it when it is right and override it when it is wrong — is more capable than either the doctor alone or the AI alone. That is the practitioner schools must now prepare students to become. Not in medicine only. In every field.
NEO — AI Ready School's hands-on innovation lab — exists to make this concrete. Not as a lesson about AI, but as a practice of working with it. Children who have built with AI, broken it, questioned it, and understood where its reasoning fails are not intimidated by a Harvard study showing AI outperforming doctors. They are curious about it. They ask the right questions. They understand why the researchers were careful with their conclusions. They are, in short, exactly the kind of thinkers a world where AI can now out-reason specialists actually needs.
The sentence from the study worth remembering
The paper's co-senior author described the study as a response to a gauntlet thrown down in Science in 1959 – a paper that described how you would know that a clinical decision support system was capable of making diagnoses better than humans. His conclusion: "They can do it."
Sixty-seven years later, the gauntlet has been answered. The question now — for medicine, for every profession, and for every school deciding what to teach the children who will practise them — is what we do next.