OpenAI reopens rare-disease cases

A NEJM AI study shows how o3 Deep Research helped specialists surface 18 diagnoses across 376 previously unsolved pediatric cases.

OpenAI published on June 18, 2026 the results of a NEJM AI study with Boston Children’s Hospital and Harvard: a reasoning model, o3 Deep Research, was used to reanalyze 376 pediatric rare-disease cases that had remained unsolved after genetic testing and specialist review. After human review, additional testing, and clinical confirmation, physicians established 18 diagnoses, an added diagnostic yield of 4.8%.

The important point is not that AI “diagnosed” patients instead of doctors. The study says the opposite. The model received de-identified data, standardized clinical descriptions, and filtered genetic-variant tables. It was asked to produce evidence-linked hypotheses connecting symptoms, inheritance patterns, possible variants, and scientific literature. Those leads were then reviewed by at least two specialists, assessed under standard genetic-classification frameworks, and confirmed in a clinical laboratory when the evidence was strong enough. In other words, AI acted as a synthesis and triage layer, not as a medical authority.

The percentage is modest, but the underlying problem is very real. In rare disease, a patient’s genome does not change while medical knowledge does: a gene becomes associated with a condition, a variant is reclassified, or a new paper adds a comparable case. Periodically revisiting old unsolved files requires time, documentary memory, and the ability to connect records that often live in separate formats and databases. The study suggests that a reasoning model can help experts prioritize leads they may not have the bandwidth to explore systematically, especially when cases have already passed through several pipelines without an answer.

The work also gives a more realistic picture of medical AI than claims of instant diagnosis. The cases were de-identified, the model’s proposals had to be justified, and a result only counted after validation by qualified clinicians. Some leads involved connecting scattered clinical signs to an already documented variant. Others generated a biologically coherent hypothesis that still required follow-up. In each case, the model widened the search space rather than replacing judgment.

Caution remains the core lesson. The authors did not measure time saved, cost, clinician workload, or the burden of false positives, and the work was retrospective. A model can still produce a plausible explanation that fails under review, which is why every candidate diagnosis went through full clinical adjudication. The sober takeaway is therefore narrower than the headline might imply: AI is not providing automatic diagnosis here, but an assisted reanalysis workflow that may help clinical teams revisit hard cases as scientific knowledge moves forward.