Diagnosis is a classification task: a patient has symptoms, the symptoms map to a list of possible conditions, the clinician selects the best label. This is an appealingly simple way to define how medicine works and in many cases, it’s good enough. In others, quite inadequate.

Diagnosis can be straightforward – when the patient presents with typical signs and symptoms, risk is low, and one explanation is much more likely than others. In these simple cases, pattern recognition works well for both clinicians and the AI systems we’re about to discuss, and care is delivered efficiently.
Other times, diagnosis is much more of a process that unfolds over time. Information arrives in bits and pieces, symptoms evolve, tests identify possibilities but also add uncertainty. Patients return for re-evaluation, and the condition changes. In these more complex situations, what’s important is not only achieving a correct final diagnosis, but the timing of it, how it’s reached, and its impact on subsequent decisions like the choice of treatment.
Consider a patient presenting with headache and feeling slightly unwell. At the first visit, there are no red flags and initial tests are normal. A benign explanation is reasonable. Twenty-four hours later, the patient returns. The headache is worse and is accompanied by neck stiffness and vomiting. A reasonable working diagnosis of tension headache is urgently revised to meningitis.
Diagnosis is often provisional and meant to be revised
This process-based understanding is reflected in formal definitions of diagnostic error. The US National Academy of Medicine defines it as a failure to establish an accurate and timely explanation of a patient’s health problem, or to communicate that explanation effectively. Timing, reasoning, and communication are part of the process.

Image: Conceptual model of the diagnostic process from the 2015 report “Improving Diagnosis in Health Care” published by the US National Academies of Sciences, Engineering, and Medicine.
Diagnostic error is common and consequential
Diagnostic error is estimated to affect about 5% of outpatient encounters and up to 15% of hospital cases. It accounts for a disproportionately large share of preventable harm. In the United States alone, around 795 000 people are estimated to die or be permanently disabled each year as a result of misdiagnosis. Errors cluster in cases where presentations are atypical, information is ambiguous, or conditions evolve.
AI and diagnosis: beyond static comparisons
Much of the public discussion about artificial intelligence in diagnosis still relies on the simplified model we started with. Claims that “AI outperforms doctors” are usually based on neat but static case descriptions or exam-style questions in which all relevant information is available upfront. These comparisons favour AI powers of pattern recognition but capture only a slice of how things work in practice.
This is important because reducing diagnostic error is a safety and quality priority and AI is increasingly positioned as part of the solution. To understand the potential, and the risks, we need to examine how diagnostic decisions are made under uncertainty.
How clinicians think
Decades of research in cognitive psychology show that expert decision-makers rely heavily on heuristics – mental shortcuts that enable rapid judgment in complex environments. These heuristics support efficient and skilled practice, particularly in routine cases, but occasionally, they misfire.
Availability bias, confirmation bias, anchoring, and framing effects are well-described features of how clinical reasoning can go wrong. Studies suggest that these cognitive biases contribute to roughly three-quarters of serious diagnostic errors. They are not signs of poor training or lack of care, but by-products of how humans manage uncertainty, time pressure and complexity.
How machines behave
As generative AI (specifically large language models) is considered for diagnostic support, important questions arise: how do these systems behave under uncertainty, and how do their failure modes differ from our own?
An emerging interdisciplinary field referred to as machine psychology has begun to address this. Machine psychology does not imply that AI systems think or feel. Instead, it treats them as complex decision-making artefacts whose behaviour can be studied empirically, using methods adapted from cognitive science.

Machine psychologists examine how AI outputs change when inputs change. They find the same diagnostic scenario can yield different conclusions if an irrelevant contextual detail is added, information is presented in a different order, or a suggestion is embedded early in the prompt. The clinical facts are unchanged but the system’s reasoning and recommendations shift in predictable ways.
Shared patterns, different origins
In diagnostic tasks, large language models often show patterns that resemble human cognition. Their outputs can also be influenced by salient but clinically irrelevant details. For example, mentioning recent foreign travel, highlighting a family history, or introducing an early triage label can tilt the AI toward certain explanations even when the key symptoms are unchanged. Once an initial hypothesis is introduced, the model may elaborate on it while giving less attention to viable alternatives.
Empirical studies comparing GPT-4 with medical residents show that AI accuracy declines when cases include misleading contextual information. At the same time, AI appears less vulnerable to biases that depend on human memory, emotion, or recent experience.
These differences are important when AI is used as a second opinion. Research shows that users are more likely to accept AI recommendations that align with their initial judgment and to discount those that challenge it. In practice, a clinician may consult an AI tool “just to check”. Agreement boosts confidence; disagreement is often ignored.
When AI suggestions are wrong, clinicians may still follow them because of automation bias – the tendency to uncritically accept machine outputs over one’s own professional judgment. In experimental settings, this has led to diagnostic accuracy falling below baseline performance. Explanations of AI recommendations help, but on their own have not been sufficient to prevent over-reliance on the AI diagnosis.
Error patterns at scale
When an individual professional makes an error, the impact usually is only local. When an algorithm embeds a flawed association and is widely deployed, the same error can propagate across organisations, regions, or entire health systems.
This has important implications for safety and governance. It means that performance averages can be misleading. How a system fails, and whether those failures are isolated or systematic, is as important as how often it is correct.
Once these patterns are recognised, AI behaviour can be adjusted. Structured prompts (e.g., asking the AI to “list five potential causes for these symptoms in order of severity”), forcing step-by-step reasoning, requiring consideration of alternatives, or having the model critique its own conclusions can reduce anchoring and premature closure. Using these methods, AI may be more amenable to systematic “debiasing” than human decision-makers.
Diagnosis as a collaborative activity
AI is here and the future of diagnosis is likely to be collaborative. Humans will provide context, judgment and ethical grounding. AI systems will tirelessly and consistently[1], retain and recall vast amounts of medical knowledge never becoming distracted or bored.
In a healthy collaboration, AI tools will support exploration rather than provide definitive answers. Their value will be in expanding the diagnostic space, identifying alternatives, highlighting uncertainty and suggesting reconsideration. Structured reflection and diagnostic checklists are known to reduce cognitive error in humans, and similar principles can be applied to AI-supported workflows.
Clinicians will engage with AI outputs as inputs to reasoning rather than endpoints. Evidence shows that AI alone can outperform clinicians on diagnostic vignettes, yet simply giving clinicians access to AI does not reliably improve performance, and can sometimes make it worse. The limiting factor is not AI capability, but how human and machine decision-making interact.
Conclusion
Many diagnoses are straightforward, and pattern recognition – human or machine – serves us well in those cases. The greatest safety challenges lie at the margins: where information is ambiguous, conditions evolve, or assumptions no longer hold. These cases account for a disproportionate share of preventable harm and are where understanding decision-making matters most.
AI will bring new forms of support, but also introduce new types of error. In high-stakes settings like clinical decision making, studying its behaviour is essential. If diagnostic error is a leading source of preventable harm, then improving diagnosis requires understanding how decisions are made by humans, by machines, and by the combination.
Cognitive science, and now machine psychology, provide framework for doing this and will continue to offer useful insights.

A summary of this article (Gemini AI)
Readings
Bergl P, et al. Diagnostic Error in the Critically Ill: A Hidden Epidemic? Crit Care Clin. 2022 Jan;38(1):11-25 https://www.criticalcare.theclinics.com/article/S0749-0704(21)00074-9/abstract
Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 2003 Aug;78(8):775-80. https://journals.lww.com/academicmedicine/fulltext/2003/08000/the_importance_of_cognitive_errors_in.12.as
Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Int Med. 2005;165(13):1493-1499. https://doi.org/10.1001/archinte.165.13.1493
Hagendorff T, et al. Machine psychology: https://arxiv.org/abs/2303.13988
Kees van den Berge, Sílvia Mamede. Cognitive diagnostic error in internal medicine. European Journal of Internal Medicine September 2013, 24(6): 525-529
Kucking F. Automation Bias in AI-Decision Support: Results from an Empirical Study. Studies in health technology and informatics. 317. 298-304. 10.3233/SHTI240871.
Mahajan A. Cognitive bias in clinical large language models. npj Digital Medicine volume 8, Article number: 428 (2025) https://www.nature.com/articles/s41746-025-01790-0
National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care. 2015. https://www.nationalacademies.org/read/21794
Newman-Toker et al. Burden of serious harms from diagnostic error in the USA. BMJ Quality & Safety 2024;33:109-120. https://qualitysafety.bmj.com/content/33/2/109.info
NEJM Grand Rounds: Zwaan L. From Hindsight Bias to Machine Bias: Dr. Laura Zwaan on Learning from Mistakes https://ai-podcast.nejm.org/e/from-hindsight-bias-to-machine-bias-dr-laura-zwaan-on-learning-from-mistakes/
Singh H, et al. The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations. BMJ Qual Saf 2014 Sep;23(9):727-31. https://qualitysafety.bmj.com/content/23/9/727.long
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 2019. https://www.nature.com/articles/s41591-018-0300-7
Wei J, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv, 2022. https://arxiv.org/abs/2201.11903
Zwaan L, et al. Diagnostic error in hospitals: finding forests not just the big trees. BMJ Quality & Safety 2020;29:961-964. https://qualitysafety.bmj.com/content/29/12/961.info
[1] At the level of individual outputs large language models are probabilistic rather than deterministic. But their behaviour is consistent in a statistical sense i.e., similar inputs reliably produce similar distributions of outputs and recurring patterns of reasoning and error. It is this behavioural consistency, rather than identical responses, that is important for safety and governance at scale.