Large language models (LLMs) are now routinely used to write clinical notes, discharge summaries, referral letters, patient instructions and insurance correspondence. Their appeal is obvious: they are fast, fluent and usually accurate. As the documentation burden increases and clinical time shrinks, AI-generated text is becoming embedded in everyday clinical workflows.
The dominant safety response has been to insist on “human in the loop” oversight: clinicians are expected to review AI-generated text, catch errors and intervene when something looks wrong.
This sounds reassuring but in practice is a fragile and misleading safety model.

Paul Klee’s Limits of Reason. The contrast between ordered and complex elements evokes the challenge of making sense of intricate narratives. Source: https://www.sammlung.pinakothek.de/en/artwork/Y0GROr84RX, Public Domain, Wikipedia.
Why “human in the loop” fails
LLMs rarely produce text that is obviously incorrect. Errors are generally subtle. They tend to involve timing, attribution, clinical rationale or misinterpretation of events that is plausible but wrong, embedded in otherwise accurate, coherent prose.
For example, a discharge summary may state that a patient was intubated (placed on a breathing machine) because their condition suddenly worsened, when the record shows the patient was intubated briefly for a planned procedure and recovered soon afterwards. Another note may attribute worsening kidney function to a diagnostic scan, when the timeline shows the problem began earlier. Individually, these statements sound reasonable. Detecting the error requires reconstructing events from multiple notes, timestamps and authors – difficult to do reliably during routine review.
Clinical work, however, happens in time pressured environments characterised by interruptions, multitasking, handovers, and competing priorities. AI documentation review becomes one more task increasing the mental overload.
Expecting clinicians to reliably detect every subtle factual error in AI-generated notes doesn’t account for practical limits to attention and judgement and assumes a level of vigilance not consistently sustainable.
This is not a criticism of clinicians but a recognition of human limits.

Rubin’s Vase. An ambiguous image that supports two interpretations depending on how the viewer’s perception organizes figure and ground. Like clinical text, context and inference shape what we see. Source: Ataturk.svg: NevitNevit Dilmen, CC BY-SA 3.0 via Wikimedia Commons
Even experts don’t consistently agree
These limits are demonstrated in a recent study which examined how well clinicians can verify factual accuracy in clinical summaries. The participants were asked to judge whether individual statements in discharge summaries were supported, not supported, or not addressed by the patient’s EHR. Each statement was reviewed by multiple clinicians, with access to the full record.
Agreement was high for clearly supported statements but dropped substantially for ambiguous or unsupported ones. Even under controlled conditions, the highest inter-rater agreement was 88.5%, reflecting the inherent ambiguity of the task rather than lack of expertise or diligence.
In other words, even experienced clinicians, given time and data, do not reliably agree on factual correctness when errors are subtle or context-dependent.
If experts can’t consistently agree in a study setting, expecting individual clinicians to function as reliable real-time error detectors in live clinical workflows is unrealistic.
The “needle in a haystack” problem
LLM-generated clinical documents typically contain many correct statements and a small number of incorrect ones.
This creates a needle-in-a-haystack problem: the clinician is asked to locate a small number of factual errors hidden inside a large volume of plausible text.
Manual review is poorly suited to this kind of task. Humans are good at synthesis, interpretation and judgement. They are much less reliable at repetitive, detailed verification under time pressure.
Relying on clinician vigilance as the primary safety mechanism is therefore not good system design. It places responsibility at the “sharp end” of care rather than embedding safety upstream.
Responsibility without control is a design failure
Current workflows put clinicians in an uncomfortable position. AI systems generate content and clinicians are expected to review and sign it, taking legal and professional responsibility even when errors originate upstream and are difficult to detect.
From a safety perspective, this is a red flag. High-risk industries don’t rely on individual vigilance as their main defence. They design systems that make errors visible, measurable and easier to catch.
Safety by design
A useful distinction from the safety literature is between safety by design and safety add-ons. Writing about disasters, including the Boeing 737 Max, Andrew Hopkins argues that truly safe systems are designed to eliminate hazards at the outset, rather than relying on layers of warnings, monitoring, and human intervention after the fact.
A simple analogy: if the risk of a house burning down in a wildfire is considered only after construction, safety depends on add-ons: sprinkler systems, water tanks, electric pumps and backup generators. If fire risk is considered during design, much of the hazard can be removed by using non-flammable materials and appropriate layouts.
Many current AI workflows resemble the first approach. Core systems are deployed with known risks, and safety is added later through monitoring dashboards, alerts, human review steps, and escalation pathways. These measures function like fire alarms and sprinklers: compensating controls layered on a design that may be fundamentally unsafe.
This is understandable in a young and rapidly evolving field. But it may also reflect a deeper pattern: adopting systems that are not inherently safe, and then investing heavily in add-ons to manage the consequences. Moving beyond “human in the loop” requires a shift toward inherently safer AI design, where verification, constraints and error visibility are built into the system itself.
From “human in the loop” to “expert in the loop”
It’s more appropriate to say “expert in the loop” than “human in the loop”.
Detecting subtle clinical inaccuracies requires domain expertise. But expertise is most effective when selectively applied to review patterns, trends and exceptions rather than manually inspecting every individual output. Experts should be used to improve system behaviour over time, identify recurring failure modes, and refine safeguards.
Learning from other high-risk industries
In other complex and/or high-risk domains – software engineering, aviation and manufacturing for example – safety does not depend on constant manual inspection. Systems are made observable.
“Observability platforms” collect structured records of system behaviour, often called traces. These traces show the inputs a system has received, how it processed them and what outputs were produced. Engineers do not review every output. Instead, they monitor behaviour across large numbers of interactions. Expert attention is triggered by signals: anomalies, drift, increased error rates or unexpected patterns.
Healthcare AI systems have, until recently, lacked this kind of observability.
Automated fact verification as system-level safety
The VeriFact system, described in NEJM AI, is an example of how this gap can be addressed and demonstrates that factual verification of AI-generated clinical text can itself be automated.
The system works by:
- Breaking an AI-generated clinical narrative into individual propositions.
- Retrieving relevant EHR facts then using a separate language model to judge whether each proposition is supported, not supported, or not addressed.
In effect, VeriFact functions as a “spellchecker for clinical facts”. It does not rewrite the note but verifies whether statements are grounded in documented patient data.
This transforms unstructured narrative into a structured record of AI behaviour – something that can be reviewed, audited and learned from.
Performance compared with human review
VeriFact achieved 93.2% agreement with clinician consensus when verifying factual alignment with the EHR, exceeding the agreement achieved by individual clinicians performing the same task.
Other findings from the study reinforce key points:
- Clinicians collectively spent over 1,600 hours annotating just over 13,000 statements, illustrating the effort required for manual verification.
- Agreement was highest for clearly supported statements and lowest for ambiguous or negative ones.
- Larger models performed better than smaller models.
- Models with explicit reasoning capability performed better still.
These findings suggest that accurate fact verification in medicine depends on both scale and reasoning, something humans struggle to sustain continuously but machines can apply consistently.

Trust Me (1860–62), by John Everett Millais, an early example of the “problem picture” – artworks with deliberately ambiguous scenes that invite multiple interpretations, echoing the interpretive challenge of validating complex clinical narratives. Source: Wikipedia (Creative Commons)
Implications for clinical workflow
The goal of automated verification is to change how human expertise is used.
Instead of manually checking every statement, clinicians can:
- Focus on flagged inconsistencies.
- Review contested or ambiguous facts.
- Apply judgement where nuance, ethics or context matter.
- Spend more time on synthesis and decision-making.
Understanding the limits of automation
Automated verification is not a panacea, however.
VeriFact verifies only what is stated. It does not detect errors of omission, such as missing diagnoses or incomplete summaries.
It treats the EHR as the source of truth. If the record has copied-and-pasted errors, outdated diagnoses or misattributions, the system will faithfully reinforce them.
This reflects the fact that no verification system can exceed the quality of the data it relies on. Automated fact checking supports accuracy within existing records. It does not replace clinical judgement or the need for good documentation practices.
Matching oversight to clinical risk
Not all AI-generated content carries the same level of risk.
Some domains – medication lists, allergies, diagnoses, procedures, legal documentation – require very high factual accuracy. Small errors can have serious consequences.
Other outputs, such as patient-friendly summaries or administrative correspondence, carry lower risk.
Automated verification allows safeguards to be applied proportionately. High-risk content can be verified systematically. Expert attention can be focused where stakes are highest.
Shared responsibility by design
As AI-generated documentation becomes routine, responsibility for safety will have to be shared.
Healthcare organisations are responsible for implementing systems that support safe practice.
Technology vendors are responsible for building tools with safeguards that match clinical risk.
Regulators and accrediting bodies are beginning to expect evidence that AI systems are monitored, measured and improved over time.
Where technical solutions exist to verify factual accuracy, their use becomes part of good system design rather than an optional enhancement.
The way forward
The question is how to design systems that use both humans and machines effectively.
Moving from “human in the loop” to “expert in the loop”, supported by automated verification and observability, is a path forward. It recognises the strengths and limits of both clinicians and AI systems. It designs safety in system structures rather than expectations of human vigilance.
Clinicians retain oversight. They intervene when judgement is required but are no longer asked to do the impossible.
Reference
Chung P, Swaminathan A, et al. Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records. NEJM AI. 2025;3(1):e2500418. doi:10.1056/AIdbp2500418.