Blame the world, not the model

Five ideas from a new paper

In an earlier article, I wrote about artificial intelligence (AI) world models: systems built to represent how the world behaves and changes over time, rather than working only via language and other static data. An April 2026 paper (preprint) takes that idea into medicine. Its main argument is that clinical AI often disappoints because it has not been designed around an explicit model of the clinical world in which it operates.

A sepsis prediction model can perform well where it was built then but miss most sepsis cases when tested elsewhere. A radiology model can look impressive in development then fail in a new hospital because it learned shortcuts rather than clinical signals. Large language models can pass medical exams yet show important limitations when assessed for factuality, reasoning, harm and bias.

Safavi-Naini and colleagues offer a structured explanation for why this keeps happening. Their paper is dense and ambitious. But for a general healthcare audience, five ideas are enough.

Clinical care is a world, not a dataset

Most AI systems are trained on data snapshots: images, notes, lab values or curated records. Clinical care is something different. It is a changing system made up of a patient, a provider, and an ecosystem of workflows, devices, records, routines and resources. The paper formalises this as a three-part model of Patient, Provider and Ecosystem. Care emerges through interaction among the three.

This sounds abstract until you apply it to an ordinary situation. Chest pain in an emergency department is not the same clinical problem as chest pain in a primary care clinic or at home. The risks, available actions, timelines and decision thresholds are different. A model that has only learned from one slice of that world may struggle when moved to another.

The “ecosystem” part is especially important. Hidden causes of AI failure are often found there: a different scanner, brand of laboratory equipment, staffing pattern, documentation workflow or threshold for escalating care. The paper is helpful because it turns important background conditions into part of the model rather than treating them as noise.

Clinical care emerges from interaction between patient, provider, and ecosystem. The ecosystem includes data, tools, workflows and local conditions that often determine whether an AI system succeeds or fails. Image: Gemini/Nano Banana

2. Performance in one place tells you little about another

This is the paper’s second important idea, possibly the one with the most immediate practical value. Clinical AI is often validated in one version of the clinical world and deployed in another. Published performance can disappoint outside the development setting. The paper argues that this is not just a technical problem of generalisation. It is also a problem of specification. We do not define clearly enough where, for whom, and under what conditions a system is meant to work.

Real-world examples fit this pattern. The externally validated Epic Sepsis Model had poor discrimination and calibration in hospitals other than the one in which it was developed. In chest radiograph AI for COVID-19, apparently accurate systems failed in new hospitals because they relied on confounding features rather than pathology. Kelly and colleagues argued several years ago that generalisability, workflow fit and external validation are the main barriers to real clinical impact.

The implication is that “Does it work?” is usually the wrong question. A better question: for which condition, in which setting, at what stage of care, for which user, and for what task?

3. A shared specification grammar

The paper is trying to give clinicians, developers and regulators a common grammar for specifying what an AI system is supposed to do and what evidence would be relevant. The framework is meant to help clinical AI be “specified, evaluated, and bounded across stakeholders.” This is the easiest part of the paper to miss and probably its most original contribution.

The mechanism is what the authors call the Clinical AI Skill-Mix. It says a clinical AI use case should be defined along multiple dimensions. Five describe the clinical scenario: condition, phase of care, setting, provider role and task. Three describe how AI engages with human reasoning: who the agent is facing, where it enters the cognitive process, and how much authority it is assigned. In the paper, the combinatorial result is a “competency space” running into the billions of possible coordinates. Validation in one cell provides only limited evidence for another.

This means more than “context matters.” It suggests that an AI model’s “license to practice” should be narrow, well-defined and earned one use case at a time.

4. AI should be described based on where it enters clinical reasoning and how much authority it gets

The paper distinguishes between different authority modes and different anchoring layers in decision-making. In the framework these are monitoring, augmentation, and automation, and they are tied to where the AI enters the clinical cognitive process.

An AI system that highlights abnormal lab results is doing something different from one that generates a differential diagnosis or one that triggers an action. Different authority levels may be warranted in different tasks. The paper argues that this should be declared because it affects accountability, user expectations, interface design and how errors propagate through the system.

This is also where many current discussions about “AI doctors” are unhelpful. The relevant question is not whether AI is intelligent or competent in the abstract. It’s about the position it occupies in the reasoning pathway, the role it’s asked to play, and whether evidence matches the claim.

Clinical care is a decision cycle rather than a simple linear pathway. AI can enter at different points in that cycle and can do so with different levels of authority: monitoring, augmentation/assistance, or automation/action. Image: Gemini / Nano Banana

5. Time matters

Clinical reasoning does not happen in isolated snapshots. It develops during repeated cycles of interaction over seconds, hours, days or months. Patients experience illness continuously while clinicians sample it only intermittently. Between encounters, symptoms evolve, adherence changes and complications can occur without being observed. Conventional care might be described as periodic observation separated by long periods of clinical blindness.

This is important because some AI systems can help fill those temporal gaps. An AI agent, unlike a clinician, need not be episodic. It can in principle monitor continuously, update its representation of the patient’s state, and detect divergence from the expected trajectory between visits. Whether that is useful or safe depends, again, on how the task is specified and what authority the system is given.

Why this is important

This paper does not “solve” clinical AI. But it improves the conversation by asking whether the AI and its context has been specified in a clinically meaningful way.

This is a useful approach for clinicians because it encourages narrower and more realistic claims. It is valuable for developers because it makes vague claims harder to hide behind. It is valuable for regulators and health systems because it moves evaluation to real use cases rather than “model performance”. It fits with calls to move clinical AI to workflow-embedded, real-world evaluation.

The world-model idea still applies. In clinical care, the “world” is not just physiology. It includes people, interactions, infrastructure, authority and time. If the world is poorly specified, even an impressive model can fail. If it is well specified, the chances of useful and trustworthy clinical AI improve.

References

Safavi-Naini SAA, Meftah E, Mohess J, et al. Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework. arXiv, 2026. https://arxiv.org/abs/2604.08226

Wong A, Otles E, Donnelly JP, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model. JAMA Internal Medicine. 2021. https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2781307

DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nature Machine Intelligence. 2021. https://www.nature.com/articles/s42256-021-00338-7

Kelly CJ, Karthikesalingam A, Suleyman M, et al. Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine. 2019. https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-019-1426-2

Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023. https://www.nature.com/articles/s41586-023-06291-2

Azad TD, Krumholz HM, Saria S. Principles to guide clinical AI readiness and move from benchmarks to real-world evaluation. Nature Medicine. 2026. https://pubmed.ncbi.nlm.nih.gov/41578031/

The translation of layered AI architecture to a distributed clinical world model of patients, clinicians and systems. Image: Gemini / Nano Banana

Leave a Comment Cancel Reply

Quick Links

Contact

Follow