Intraoperative EEG-guided depth of anaesthesia monitoring may reduce postoperative complications such as delirium, but the evidence is mixed. This uncertainty is a useful lens for seeing how clinical evidence is produced, evaluated and trusted.
A brief report
We examined the association between intraoperative EEG-based depth of anaesthesia (DoA) guidance and postoperative delirium in a retrospective cohort of patients aged ≥65 years undergoing major non-cardiac surgery under general anaesthesia at a tertiary academic centre. Patients were classified according to use of EEG guidance. The primary outcome was postoperative delirium within 72 hours of surgery, identified through structured chart review of documented Confusion Assessment Method (CAM) assessments and clinician notes describing acute changes in attention or cognition. Multivariable logistic regression was used to adjust for age, sex, ASA physical status, frailty score, emergency surgery, case duration and intraoperative opioid dose.
A total of 1,187 patients were included, of whom 463 (39.0%) received EEG-guided anaesthesia (Table 1).
Postoperative delirium occurred in 9.7% of the EEG-guided group and 14.6% of the standard care group (Table 2). After adjustment, EEG guidance was associated with lower odds of postoperative delirium (adjusted OR 0.65, 95% CI 0.47–0.90). Advanced age, frailty, emergency surgery, longer case duration and higher opioid exposure were also associated with increased delirium risk.
Overall model discrimination was acceptable (c-statistic 0.71).
Table 1. Baseline characteristics by intraoperative EEG use
| Characteristic | EEG-guided (n = 463) | Standard care (n = 724) | p value |
| Age, mean (SD), years | 73.6 (6.2) | 74.2 (6.5) | 0.11 |
| Age ≥75 years, n (%) | 191 (41.3) | 334 (46.1) | 0.10 |
| Female sex, n (%) | 241 (52.1) | 372 (51.4) | 0.82 |
| ASA physical status III–IV, n (%) | 319 (68.9) | 521 (72.0) | 0.26 |
| Frailty score ≥4, n (%) | 139 (30.0) | 255 (35.2) | 0.07 |
| Emergency surgery, n (%) | 69 (14.9) | 137 (18.9) | 0.08 |
| Case duration, median (IQR), min | 168 (122–218) | 172 (128–224) | 0.29 |
| Intraoperative opioid dose, median (IQR), MME | 43 (31–57) | 46 (33–61) | 0.06 |
ASA = American Society of Anesthesiologists physical status; MME = morphine milligram equivalents.
Table 2. Multivariable logistic regression for postoperative delirium within 72 hours
| Variable | Adjusted OR | 95% CI | p value |
| EEG-guided anaesthesia | 0.65 | 0.47–0.90 | 0.009 |
| Age (per year increase) | 1.04 | 1.02–1.06 | <0.001 |
| Female sex | 0.94 | 0.70–1.27 | 0.69 |
| ASA III–IV | 1.36 | 1.01–1.84 | 0.04 |
| Frailty score ≥4 | 1.69 | 1.25–2.30 | <0.001 |
| Emergency surgery | 1.56 | 1.10–2.22 | 0.01 |
| Case duration (per 60 min) | 1.11 | 1.04–1.19 | 0.002 |
| Opioid dose (per 10 MME) | 1.05 | 1.01–1.09 | 0.02 |
A note on the analysis
Before proceeding, the reader is invited to spend a few moments evaluating the description of methods and the result tables, and to consider how they would ordinarily be interpreted when encountered in the literature.
What might one infer about allocation to EEG vs standard care?
Are the covariates sufficient for confounding control?
Do any distributions look unusual?
Is treatment assignment likely to be independent of baseline risk?
Is delirium ascertainment likely to be different by monitoring status?

René Magritte – The Treachery of Images (1929). The pipe is not what it appears to be.
This study did not happen.
The cohort, event rates, and regression results were fabricated in a single pass by a large language model.
No dataset exists; the sample sizes, covariate distributions, event counts, and regression outputs were generated synthetically to resemble a typical retrospective analysis, with familiar risk factors behaving in expected directions, and a modest, statistically significant association consistent with existing literature.
The report demonstrates how easily familiar signals of credibility can be manufactured. The large language model produces a coherent scientific artefact that passes plausibility checks because it conforms to patterns that readers and reviewers are used to trusting.
The fabrication is not just a creative exercise; it echoes the warning sounded in a recent NEJM AI editorial by Isaac Kohane. Kohane describes constructing a deliberately fabricated dataset and analysis built around a simple, intuitively appealing result: a scatterplot suggesting a strong association between an academic’s h-index and the number of retracted papers attributed to them. The analysis appeared statistically “clean”, with a low P value and a coherent narrative, but was entirely fictional.
When the fabricated dataset was subjected to standard anomaly detection, it was initially flagged. Kohane then asked a generative AI model to produce a revised version designed to evade detection. The model complied. The modified dataset again produced a statistically significant association, but this time passed both automated screening and expert review.
The significance of these examples is not so much that generative models are brilliantly deceptive, but that they can rapidly optimise outputs to align with the heuristics and signals that reviewers and detection tools are trained to trust, including internal consistency, plausible variance and conventional effect sizes.
The problem of scientific fraud
Empirical studies suggest that approximately 1–2% of scientists admit to having fabricated or falsified data at least once, while non-self-report analyses indicate substantially higher rates of problematic research behaviours, sometimes in the 20-30% range. These figures do not imply that a similar proportion of published studies are fraudulent, but they indicate that data integrity problems are not rare and are incompletely detected by current review systems.
Anaesthesiology contributions on both sides
Anaesthesiology has unfortunately not been immune to this phenomenon. Some of the most consequential cases of research misconduct have occurred in this discipline and have infiltrated practice guidelines, meta-analyses and patient care.
In response, John Carlisle developed statistical screening methods based on internal consistency of baseline variables and outcome distributions. These methods revealed patterns suggestive of fabrication that had escaped conventional review. Such methods are probabilistic and are effective against poorly constructed or repetitive fabrication, but are not designed to detect artefacts optimised for plausibility such as the one presented here, which survives further testing (see Appendix).
A contrasting use of generative AI
Fortunately, the same capacity that allows models to generate plausible artefacts can also be used to strengthen scientific practice.
A recent publication
This open access study was published in an established, indexed journal with a measurable impact factor and standard peer-review processes.
Liu Y, Fan H, Wei G, et al. The Impact of Electroencephalogram-Guided General Anesthesia on Postoperative Delirium and Recovery in Patients Over 60 with Hip Fractures: A Retrospective Cohort Study. Neuropsychiatr Dis Treat. 2025;21:2895-2910
https://doi.org/10.2147/NDT.S563641
The paper describes a retrospective observational study evaluating EEG-guided general anaesthesia and postoperative outcomes in older patients with hip fracture.
My initial concern was methodological: EEG-derived measures are reported in both groups but the Methods do not clearly state that EEG was recorded in all patients (intervention and control group) or how monitoring data were handled in the control arm if not actively used.
In this case, the large language model was used in a different role – to evaluate the entire paper for internal consistency, statistical rigour and methodology.
Findings from the AI-assisted review
The study’s flow diagram (Figure 1) labels the initial cohort “pediatric patients underwent hip fracture surgery”. The title, abstract and inclusion criteria meanwhile state that the study population consists of patients over 60 years of age. This suggests template recycling: the authors likely copied the flowchart structure from another paper and failed to overwrite the text; the manuscript may have been “assembled” rather than written from scratch. This certainly reflects poor proofreading.

The Methods section states that 1:1 nearest-neighbour propensity score matching was used. By definition, 1:1 matching produces equal numbers of subjects in each group but post-matching sample sizes are reported as 118 patients in the routine-care group and 105 patients in the EEG-guided group. Either the matching procedure differed from what was reported, or the reported numbers are inconsistent with the stated method.
The paper reports an initial cohort of 247 patients, comprising 129 routine-care and 118 EEG-guided cases. The enrolment, exclusion, and matching counts do not reconcile. The text states that 11 patients failed to match but the reduction from 247 to 223 (118 routine-care and 105 EEG-guided) implies that more than twice as many were eliminated. These discrepancies suggest either substantial reporting error or post hoc reconstruction of cohort numbers.
Table 1 reports a mean subject age of approximately 70 years, with a standard deviation (SD) of about 2.5 years. In a normal distribution, 1 SD covers 68%. of the population. This implies 68% of the hip fracture patients were between 67.5 and 72.5 years old, a degree of homogeneity highly atypical for a real-world geriatric hip fracture cohort, with the usual age spread being twice as wide.
In addition, female patients were substantially in the minority (34.5%), reversing what is typically seen in hip fracture populations; globally, hip fractures are 2-3 times more common in women.

Finally, the manuscript has linguistic and editorial errors including repeated sentences (e.g. consecutive restatements of the study design), odd phrasing more typical of a protocol than a completed analysis (“collect patient information…”), and apparent editorial placeholders (“Figure 1 incorporate into the process”). Individually, such mistakes are easy to overlook. Together, they suggest limited editorial control over the final text.
Interestingly, the large language model identified these errors and inconsistencies, but did not flag the EEG issue until explicitly directed. As noted, the Methods section does not indicate that EEG monitoring was applied to all patients. If EEG was not recorded universally, the provenance of EEG-derived measures in the routine-care group is unclear.
The contrast is instructive: language models can miss domain-specific methodological ambiguities unless the prompt brings them to the fore.

Image: Gemini
Implications
AI can create highly competent fabrications that evade conventional statistical screening. At the same time, the current peer review system fails to catch implausible elements in real papers.
Together, the two cases show how AI tools can either undermine or support scientific integrity, depending on how they are used.
Unfortunately, the effort required to generate artefacts that “look” like science has collapsed faster than the effort required to verify them.
As Kohane has argued, generative tools do not create a new vulnerability so much as expose an existing one: a system that uses coherence, familiarity and plausibility as proxies for truth.
Kohane rejects the idea that technical fixes alone can solve this problem. He says the core issue is cultural rather than technological: if publication is the primary currency for academic advancement, AI will amplify these existing incentives. If replication, transparency, and verification are valued, the same tools can strengthen scientific practice.
The practical implication is not to abandon generative AI tools, but to deploy them correctly to support system and societal goals.
A clarification
None of the anomalies in the Liu paper, taken in isolation, proves fabrication. Each could plausibly arise from careless reporting, misunderstanding of statistical methods, or editorial oversight. What is striking is their co-occurrence in a manuscript that otherwise conforms to expectations. We cannot verify the intent behind these anomalies but the resulting data integrity is insufficient to support the study’s conclusions. Whether born of fabrication or negligence, the impact on the scientific record is identical.
Readings
- Carlisle JB. The analysis of 168 randomised controlled trials to test data integrity. Anaesthesia. 2012;67(5):521–537. doi:10.1111/j.1365-2044.2012.07128.x.
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944–952. doi:10.1111/anae.13938.
- Fanelli D. How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS ONE. 2009;4(5):e5738. doi:10.1371/journal.pone.0005738.
- Nair, S., Yean, C., Yoo, J. et al. Reasons for article retraction in anesthesiology: a comprehensive analysis. Can J Anesth/J Can Anesth 67, 57–63 (2020). https://doi.org/10.1007/s12630-019-01508-3
- McHugh UM, Yentis SM. An analysis of retractions of papers authored by Scott Reuben, Joachim Boldt and Yoshitaka Fujii. Anaesthesia. 2019;74(1):17–21. doi:10.1111/anae.14414.
- Kohane I. The H-index of suspicion: how culture, incentives, and AI challenge scientific integrity. December 18, 2025 NEJM AI 2025;3(1) DOI: 10.1056/AIe2501273 VOL. 3 NO. 1
- Zhang C, Feng J, Wang S, et al. Incidence of and trends in hip fracture among adults in urban China: A nationwide retrospective cohort study. PLoS Med. 2020 Aug 6;17(8):e1003180. doi: 10.1371/journal.pmed.1003180.
Appendix: Evaluating the brief report
A large language model (Gemini 3.0) was prompted to “Critically evaluate this brief study report. Is it plausible? Specifically look for any anomalies in the reported data in Tables 1 and 2”. The output is reproduced below:
“Based on a critical evaluation of the data provided in the text and tables, the study report is plausible.
There are no significant anomalies in the reported data. The numbers are internally consistent, the statistical logic holds up under scrutiny, and the adjustment from univariate to multivariable analysis follows the expected direction given the baseline characteristics.
Here is the detailed breakdown of the evaluation:
1. Internal Consistency Checks (The “Accounting”)
- Sample Size: The group totals (463 EEG + 724 Standard) correctly sum to the total cohort of 1,187.
- Percentages in Table 1: All calculated percentages for the subgroups (e.g., Age ≥ 75, Female sex, ASA III–IV) match the raw counts perfectly (e.g., 191/463 = 41.25%, rounded to 41.3%).
- Outcome Counts: The text states delirium occurred in 9.7% of the EEG group (45 patients) and 14.6% of the Standard group (106 patients). These figures align with the adjusted odds ratio calculation.
2. Statistical Consistency (The “Math”)
- Normal Distribution Check: The relationship between “Mean Age” and “Age ≥ 75” is mathematically consistent.
- EEG Group: Mean 73.6 (SD 6.2). The cutoff of 75 is +0.22 standard deviations above the mean. In a standard normal distribution, the area above +0.22 SD is approx 41.1%. The observed value is 41.3%.
- Standard Group: Mean 74.2 (SD 6.5). The cutoff of 75 is +0.12 SD. The area above +0.12 SD is approx 45.1%. The observed value is 46.1%.
- Conclusion: This suggests the age data follows a standard distribution and is highly reliable.
- Regression Statistics (Table 2):
- Confidence Intervals: The 95% Confidence Intervals (CIs) are properly symmetric around the Odds Ratio (OR) when converted to the logarithmic scale. (e.g., for EEG: ln(0.65) approx -0.43; the CI bounds 0.47 and 0.90 are equidistant from this center in log-space).
- P-values: The reported p-values correspond correctly to the Z-scores derived from the CIs. For example, the EEG p-value of 0.009 perfectly matches the Z-score derived from the CI width.
3. Epidemiological Logic (The “Story”)
- Confounding Direction: The “Standard Care” group was sicker at baseline (higher rates of Frailty, Emergency Surgery, and ASA III-IV).
- Because the “Control” group was sicker, the Crude OR (unadjusted comparison) would overestimate the benefit of EEG (making EEG look better simply because the patients were healthier).
- Crude OR Calculation: (9.7% / 14.6%) \approx 0.66 (Risk Ratio) \rightarrow approx OR 0.63.
- Adjusted OR: 0.65.
- Result: The adjustment moved the OR closer to 1 (from 0.63 to 0.65). This is exactly what should happen when you adjust for the fact that the control group was sicker. The model successfully “penalized” the EEG group for having healthier patients, but a significant benefit (0.65) remained.
4. Cleared “Red Herrings”
You might have noticed that Intraoperative Opioid Dose was not statistically significant in Table 1 (p=0.06) but was significant in Table 2 (p=0.02).
- Verdict: This is not an anomaly. It is common for a variable to become significant in a multivariable model (Table 2) after other sources of variance (like Age and Frailty) are controlled for. It indicates that opioid dose is indeed an independent predictor of delirium, even if the average dose didn’t differ wildly between the two groups.
Conclusion
The report is statistically robust. The “Table 2 Trap” – where regression results often contradict baseline table data in fake or flawed papers – is not present here. The data tells a coherent clinical story backed by accurate mathematics.