Health Systems Action

Evaluating China’s push into autonomous Clinical AI

China is running enormous experiments with Artificial Intelligence (AI) in the clinical encounter. But how well do these systems actually perform? Are they safe for patients? I look for evidence and discuss what countries like SA should take from China’s rapid healthcare AI rollout.

Why China is Pursuing Autonomous AI

China’s drive toward autonomous AI in healthcare is driven by structural pressures. The country has 22% of the world’s population, but proportionately far fewer medical resources, along with deep urban-rural disparities, and overwhelming demand at primary-care level. National policy frameworks such as Healthy China 2030 promote digital tools, including AI-driven interviewing and triage, as key to expanded access. In this context it’s not surprising that AI models are being brought into roles normally performed by clinicians.

Bold Claims

Innovative systems such as Ping An’s interview bots, JD Health’s digital twins, and Tsinghua University’s Agent Hospital are reported as operating at scale. The remarkable numbers posted include up to 4 million daily patient interactions handled by virtual doctors with “99% triage accuracy”, “95% diagnostic accuracy” and “coverage of 37,000 diseases”; Agent Hospital has treated tens of thousands of patients. However, almost all these claims come from company records, internal evaluations or media coverage rather than independent peer-reviewed studies, so their reliability is uncertain.

Autonomous versus Supervised AI

It’s helpful to distinguish autonomous AI – systems that interview patients, assign triage categories, or recommend treatment without real-time clinician oversight – from supervised AI tools that support clinicians while leaving final decisions in human hands. The safety profiles differ substantially. Autonomous systems can generate plausible-sounding but incorrect information (hallucinations), and without a clinician in the loop those errors may not be intercepted. Many patients may be unaware that they are interacting with an AI system or that it has important limitations.

Published Evidence

A quick review of published, peer-reviewed studies suggests that most high-quality evidence from China relates to supervised systems, where clinicians retain responsibility and the AI acts as a support tool rather than an independent decision-maker.

In paediatric ophthalmology, a multi-centre randomised trial of CC‑Cruiser for childhood cataract showed that the AI could identify cases with good accuracy, though still below that of senior clinicians (87% vs 99%). Another tool, EE-Explorer, used for triage of emergency eye problems, demonstrated excellent discrimination (AUC >0.98) and performed better than triage nurses at identifying urgent cases. Final diagnosis and management remained under specialist control.

A retrospective study of the Lingyi Zhihui chest-pain diagnosis model in more than 11,000 patients found high sensitivity (91%) but much lower specificity (70%). This pattern suggests possible usefulness as an early-warning tool, but the high false-positive rate would make it unsuited to autonomous triage without clinician review.

The AMTES (AI-Powered Medical History-Taking Training and Evaluation System), built on DeepSeek-V2.5, showed 98–99% response accuracy and more than 99% contextual appropriateness when used by medical students practicing history-taking with simulated patients. These results indicate strong performance as an educational tool, but they don’t demonstrate safety or effectiveness in real clinical encounters.

A notable inpatient example is an AI-supported venous-thromboembolism (VTE) decision‑support system at Ruijin Hospital that reduced hospital-associated VTE by 46% in a large randomised trial. This represents a genuine patient-level outcome improvement. Importantly, the AI acted as a prompt and risk-assessment tool; clinicians still made the final prophylaxis decisions.

Agent Hospital: a Training and Testing Environment

Tsinghua University’s “Agent Hospital” is a large-scale simulation environment where AI agents play the roles of nurses, doctors, and patients interacting throughout the clinical workflow (triage, registration, consultation, treatment, follow-up). Although reported agent performance on benchmark exams such as MedQA is high, scoring 93%, this platform trains and tests AI behaviours in simulation only. It provides no evidence that AI agents can safely manage real patients.

Mental‑health AI

In mental health, tools such as Emohaa, a chatbot using cognitive-behavioural techniques, have shown improvements in depression, negative affect and insomnia in randomised trials. By contrast, Xiaoice, an emotional-support companion with 650 million users, illustrates massive consumer engagement but this does not constitute clinical evidence and Xiaoice is not regulated as a medical device.

In Summary, Evidence Falls Short

In the areas of autonomous interviewing, diagnostic reasoning and triage, prospective validation is limited. Promotional claims dominate and independently verified clinical studies are rare. Although Chinese-language sources offer insight into innovation, deployment, and policy, major clinical trials intended to influence global practice are typically published in English, making it unlikely that a large body of rigorous evidence is being overlooked.

Implications

Large-scale deployment can’t substitute for independent, transparent evaluation. As elsewhere, China’s strongest evidence is in supervised, clinician-in-the-loop systems. The safety and effectiveness of autonomous AI clinicians is unproven, and premature adoption carries risks. Countries such as South Africa should prioritise supervised AI tools that are locally validated and embedded in accountable clinical workflows. China is a global leader in healthcare AI deployment so we should watch developments there closely, learn from both progress and missteps, and adapt wisely for our context.

……….

Endnotes & Links

Ping An Health’s AskBob system manages primary care for 400+ million registered users. The Xin Yi AI Doctor performs autonomous interviews and triage. Up to 4 million consultation requests are handled daily. AI integration has cut average service cost by 52% per family doctor user year-on-year. A triage accuracy rate of over 99% is claimed.

JD Health’s AI Jingyi platform uses the “Digital Twin” idea to model specialist reasoning. The AI Diagnosis Assistant 2.0 has increased Electronic Medical Record (EMR) writing efficiency by 120% and a triage accuracy rate of 99.5% is reported.

Tsinghua’s Agent Hospital is a simulated medical environment in which AI Doctor Agents treat AI Patient Agents. The purpose is training, not of human clinicians, but AI agents. The system can simulate treating 10,000 patients in a few days. Agents achieved 93.1% accuracy on the MedQA dataset (USMLE questions) for the respiratory subset. A public pilot is scheduled for Beijing Tsinghua Changgung Hospital in 2025.

At Ruijin Hospital in Shanghai, a prospective randomized trial involving nearly 20,000 hospitalized patients tested an AI Clinical Assistant for Venous Thromboembolism (VTE) prophylaxis. The intervention group saw a 46% decrease in the incidence of hospital-associated VTE and the mechanical prophylaxis rate increased by 24%.

AI has also performed well in high-volume triage situations. For example, a retrospective study at Peking University Third Hospital tested the Lingyi Zhihui chest pain triage system using data from 11,428 patients. The AI model achieved higher sensitivity (91%) than human experts (79.6%) but with lower specificity (70.2 vs 89.3%).

A recent prospective evaluation of DeepSeek-V2.5 also demonstrated high accuracy in structured history-taking in controlled conditions, though this remains far from fully autonomous clinical interviewing.

The EE-Explorer system for eye emergency triage statistically outperformed triage nurses (p <0.001) and was highly accurate (AUC of 0.99) in external validation.

The CC-Cruiser trial for diagnosing childhood cataracts, is one of the few Chinese AI systems tested in a multicentre randomised design. The AI performed well but still lagged behind senior clinicians (87% vs 99% accuracy), illustrating that even in well-defined tasks AI functions best as an assistive tool rather than a replacement.

Evaluation of Ping An’s AskBob system for gastroesophageal cancer at the National University Hospital (NUH) of Singapore found a 96% concordance rate with decisions of a multidisciplinary oncology board. However this was an observational concordance study, not a prospective trial, not autonomous and not a general AskBob validation (just one module).

A Randomized Controlled Trial (RCT) assessed the effectiveness  of Emohaa, a CBT (Cognitive Behavioural Therapy) chatbot, which was shown to deliver significant improvement in symptoms of depression, negative affect and insomnia compared to control groups.

Xiaoice -originally developed by Microsoft – is an emotional companion that serves an incredible 660 million users, and is known for its ability to analyze emotional cues and build pseudo-relationships with users over time.  However, Xiaoice is not a medical device, it is not clinically validated and usage numbers refer to all social and consumer interactions; this is a consumer emotional-companion bot, not a clinical tool.

Generative AI models notoriously “hallucinate,” producing fluent but incorrect medical advice. Domain-tuned models like HuatuoGPT use real-world doctor-patient data to reduce these kinds of errors but even commercial press releases note that non-trivial error rates persist.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top