Cannabis Science

General Purpose AI Models Beat Clinical AI Tools in Medical Study

By Benjamin Caplan, MD

16 Min Read

Comments Off

Table of Contents

General-Purpose Large Language Models Outperform Specialized Clinical AI Tools on Medical Benchmarks

General-Purpose Large Language Models Outperform Specialized Clinical AI Tools on Medical Benchmarks

A three-stage benchmark study published in Nature Medicine finds that GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 surpass OpenEvidence and UpToDate Expert AI on every evaluation used, including clinician-rated real-world queries.

Study at a Glance

Publication	Nature Medicine (2026); DOI: 10.1038/s41591-026-04431-5
Study Type	Comparative benchmark evaluation of AI systems
AI Systems Tested	GPT-5.2, Gemini 3.1 Pro Preview, Claude Opus 4.6 (general-purpose); OpenEvidence, UpToDate Expert AI (clinical-specific); Google Search AI Overview (reference comparator)
Benchmarks Used	MedQA (500 questions), HealthBench (500 items), Real Clinical Queries (100 queries, 1,800 clinician annotations)
Human Evaluators	12 practicing U.S. clinicians, blinded to model identity
Primary Finding	General-purpose LLMs outperformed specialized clinical AI tools across all three evaluation stages
Key Secondary Finding	Specialized clinical tools performed at approximately the same level as Google Search AI Overview on real clinical queries
Clinical Relevance	High: directly pertains to AI tools currently entering clinical workflows

Background

The deployment of artificial intelligence tools in clinical settings has accelerated substantially over the past several years. Health systems, hospital networks, and private practices are now actively adopting AI products marketed under the premise that domain-specific design confers clinical superiority over general-purpose alternatives. Two categories of tools have emerged in parallel: frontier general-purpose large language models (LLMs) accessible to any user with internet access, and proprietary clinical AI platforms built on LLM architectures and enhanced with curated medical databases, retrieval-augmented generation (RAG) pipelines, or domain-specific fine-tuning.

Want to apply this research to your care?

CED Clinic translates emerging research into individualized clinical care. Dr. Caplan has treated 30,000+ patients.

Book a consultation →

The theoretical rationale for clinical AI specialization is coherent. Medical language is precise; clinical reasoning involves weighing probabilistic differentials against patient-specific context; and errors in this domain carry consequences that errors in general text generation do not. The assumption embedded in most clinical AI procurement decisions is that tools engineered specifically for medicine will outperform general-purpose models on medically relevant tasks.

That assumption had not been subjected to rigorous independent head-to-head testing at the time this study was conducted. Vendors of clinical AI tools are not required to publish comparative performance data before entering clinical markets, and health systems typically lack the infrastructure to conduct their own pre-deployment evaluations. The result is a widespread pattern of adoption based on marketing claims rather than independently verified benchmarks.

This study was designed to fill that evidentiary gap directly, comparing two leading proprietary clinical AI platforms against three frontier general-purpose LLMs across a structured progression of medical knowledge and clinical reasoning tasks, culminating in a novel benchmark built from real queries submitted by practicing physicians.

Objectives

The study pursued three linked objectives. First, it sought to determine whether specialized clinical AI tools perform better than general-purpose LLMs on standardized medical knowledge assessments. Second, it assessed whether clinical AI tools demonstrate superior alignment with clinician reasoning and values as measured by a dedicated clinician-alignment framework. Third, and most clinically consequential, it evaluated comparative performance on a novel benchmark constructed from actual physician-generated queries in a live clinical environment, using blinded expert clinician review as the primary scoring mechanism.

Methods

Study Design

This was a prospective comparative benchmark evaluation. All five AI systems were assessed under identical conditions across three sequential stages, each targeting a distinct dimension of medical AI performance. The design is not a randomized controlled trial and does not involve patient-level interventions or outcomes. Its findings concern AI system behavior, not clinical care delivery.

AI Systems Evaluated

The three general-purpose frontier LLMs tested were OpenAI GPT-5.2, Google Gemini 3.1 Pro Preview, and Anthropic Claude Opus 4.6. These systems were deployed without domain-specific modification or clinical fine-tuning, meaning any performance advantage they demonstrated reflects the capability of the base models as a general user would access them. The two clinical AI tools were OpenEvidence and UpToDate Expert AI, both of which are built on LLM architectures and marketed to health systems on the basis of domain-specific training and, in at least some configurations, RAG pipelines connecting the model to curated medical literature databases. Google Search AI Overview served as a floor-level reference comparator, representing AI-mediated output available through a standard internet search rather than a purpose-built clinical or general-purpose AI system.

Benchmark Stages

Stage 1: MedQA (500 questions). MedQA is a well-validated dataset of United States Medical Licensing Examination (USMLE)-style questions assessing factual medical knowledge, structured recall, and stepwise clinical reasoning. Accuracy on these items constitutes a widely used benchmark in medical AI research.

Stage 2: HealthBench (500 items). HealthBench is a newer evaluation framework designed to measure alignment between AI responses and clinician reasoning, values, and communication standards. It probes dimensions of AI performance that USMLE-style accuracy cannot capture, including whether responses are appropriate, safe, and practically useful from a practicing clinician’s perspective.

Stage 3: Real Clinical Queries (RCQ) Benchmark (100 queries, 1,800 annotations). The RCQ benchmark was constructed specifically for this study from 100 de-identified queries that practicing physicians had previously submitted to a general-purpose LLM during actual clinical work. These were not hypothetical or textbook questions; they represent the organic information needs of clinicians in real practice contexts. All five AI systems generated responses to each query. Twelve practicing U.S. clinicians then reviewed these responses in a randomized, blinded format, scoring outputs without knowledge of which model produced each response. The 1,800 total annotations yield approximately 15 clinician ratings per query across all models, providing a reasonably dense signal for comparative evaluation.

Clinician Evaluator Panel

Twelve practicing U.S. clinicians served as expert evaluators in Stage 3. Their role was to assess AI-generated responses to clinical queries under blinded conditions. Specific details regarding their specialties, years of experience, practice settings, and geographic distribution are not reported in the available data, which limits assessment of evaluator generalizability.

Outcomes

Primary outcomes were comparative performance rankings across all three benchmark stages. Secondary observations addressed the transparency gap created by undisclosed proprietary architectures, the failure of domain-specific modification to confer performance advantage, and the relative performance of clinical AI tools against the Google Search AI Overview reference comparator.

Primary Results

The primary finding is consistent and directionally unambiguous: frontier general-purpose LLMs outperformed specialized clinical AI tools across every benchmark stage evaluated.

On MedQA, GPT-5.2, Gemini 3.1 Pro Preview, and Claude Opus 4.6 each achieved higher accuracy scores than either OpenEvidence or UpToDate Expert AI. The general-purpose models performed better on structured medical knowledge and USMLE-style clinical reasoning despite having no domain-specific modifications applied for this evaluation.

On HealthBench, the pattern held. General-purpose models again outperformed both clinical AI tools on the dimension of clinician-value alignment, the very capability that specialized clinical AI platforms are most explicitly engineered to optimize.

On the RCQ benchmark, practicing physicians reviewing responses under blinded conditions rated the outputs of GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 as superior to those from OpenEvidence and UpToDate Expert AI. This finding has the greatest face validity for clinical translation because the inputs were drawn from real physician queries and the scoring was performed by clinicians with direct experience of those information needs.

A calibration note is warranted: the data available for this synthesis do not include specific numerical scores, percentage-point differences, or confidence intervals. Readers seeking precise quantitative comparisons should consult the full manuscript. The directional conclusions reported here reflect what the authors clearly establish; the magnitude of differences cannot be specified from the available summary data.

Secondary Results

Clinical AI Tools Performed at the Level of Google Search AI Overview

On the RCQ benchmark, the two specialized clinical AI platforms performed at approximately the same level as Google Search AI Overview. Google Search AI Overview is not a clinical tool. It is not trained on curated medical literature for professional use, not subject to clinical safety review specific to its use as a decision-support instrument, and not marketed to health systems as a clinical AI product. The approximate equivalence of purpose-built, commercially deployed clinical AI tools to a general consumer search feature is a consequential finding for any clinician or administrator evaluating AI procurement decisions.

Domain-Specific Modification Did Not Confer a Performance Advantage

The theoretical foundation of clinical AI specialization is that RAG pipelines, curated medical databases, and domain-specific fine-tuning yield clinically superior outputs. This study does not support that premise. General-purpose models operating without these enhancements outperformed models that deploy them. This does not establish that domain-specific modification is never beneficial, but it directly challenges the assumption that it reliably produces performance gains sufficient to justify adoption over freely available alternatives.

The Transparency Gap Has Measurable Consequences

OpenEvidence and UpToDate Expert AI do not publicly disclose their underlying architectures, base models, or training pipelines. This opacity prevented the researchers from determining why these tools underperformed. It also means that clinicians and health systems deploying these tools are making consequential procurement decisions without the technical information necessary for independent risk assessment. This study quantifies one consequence of that opacity: documented underperformance relative to alternatives available to anyone with a laptop.

Blinded Clinician Reviewers Detected Performance Differences

The fact that 12 clinicians, reviewing outputs without knowledge of model identity, consistently rated general-purpose LLM outputs higher than clinical AI tool outputs indicates that these differences are perceptible at the level of clinical judgment, not merely at the level of automated scoring metrics. This reinforces the practical relevance of the performance gap.

Adverse Events

This study did not involve human participants receiving clinical interventions, and no adverse events in the traditional pharmacological or procedural sense were applicable. The study did not formally adjudicate AI-generated errors, hallucinations, or safety failures in individual responses, though the HealthBench framework incorporates clinician-alignment criteria that indirectly penalize outputs clinicians would consider inappropriate or unsafe. Detailed error taxonomies were not available in the data used for this synthesis.

Subgroup Analyses

No formal subgroup analyses stratified by query type, medical specialty, disease category, or clinician evaluator characteristics were described in the available data. The RCQ benchmark included 100 queries drawn from real-world clinical use, which likely spans multiple clinical domains, but domain-specific performance breakdowns are not reported in the summary. This is a meaningful gap: performance differences between general-purpose and clinical AI tools may vary by specialty or query complexity in ways that aggregate rankings cannot reveal. Future iterations of this evaluation framework would benefit from pre-specified subgroup analyses across clinical domains.

Statistical Rigor

The study employed structured, blinded evaluation by multiple independent clinician reviewers across the RCQ benchmark, with 1,800 total annotations across 100 queries and five models. Blinding of evaluators to model identity is an appropriate methodological choice that reduces, though does not eliminate, scoring bias. The use of three sequential benchmarks addressing different performance dimensions strengthens the overall evidentiary case by demonstrating consistency across evaluation approaches.

Limitations in statistical reporting are notable from the available data: specific accuracy scores, percentage-point differences between systems, inter-rater reliability statistics, and formal significance testing results are not extractable from the summary provided. The absence of confidence intervals and p-values makes it impossible to assess the precision of the reported findings or the probability that observed differences reflect true performance gaps rather than measurement variability. These details are available in the full published manuscript and should be consulted before drawing quantitative conclusions.

Careful Reader Takeaway

This study establishes a clear directional finding within a well-structured benchmark framework: general-purpose LLMs outperformed specialized clinical AI tools across every evaluation stage, including one constructed from real-world physician queries and scored by blinded clinician reviewers. That finding is robust to the absence of full quantitative detail.

Several caveats apply. Benchmark performance does not equate to clinical outcome superiority; a model that scores higher on MedQA or HealthBench has not been shown to improve patient outcomes, reduce diagnostic errors, or enhance safety in actual care delivery. The 12-clinician evaluator panel may not be representative of the full diversity of U.S. clinical practice. The proprietary architectures of the clinical AI tools are undisclosed, which prevents mechanistic explanation of the performance gap and raises questions about whether the versions tested reflect current or updated deployments. Finally, the rapidly iterative nature of LLM development means that performance rankings in any benchmark evaluation are subject to change as models are updated.

The study’s most durable contribution may be methodological: the RCQ benchmark, built from real physician queries and scored by blinded clinicians, offers a more ecologically valid evaluation framework than USMLE-style accuracy alone, and its broader adoption would strengthen the evidence base for clinical AI procurement decisions.

Dr. Caplan’s Commentary

What this study documents is something that many clinicians who use AI tools daily have suspected but lacked the structured evidence to articulate: the “clinical” label on an AI product does not confer clinical superiority. It confers marketing positioning. The finding that OpenEvidence and UpToDate Expert AI, two products actively entering clinical workflows at scale, performed at approximately the same level as Google Search AI Overview on real physician queries is not a minor footnote. It is the central provocation of this work, and it demands a serious response from the health systems, professional societies, and regulatory bodies that have largely remained passive as these tools proliferate.

The transparency problem compounds the performance problem. When a clinical AI tool’s base model, training pipeline, and retrieval architecture are proprietary and undisclosed, independent researchers cannot explain why the tool underperforms, and clinicians cannot make informed judgments about when to trust it. We accept transparency as a baseline requirement for pharmaceuticals, devices, and clinical protocols. There is no principled reason to exempt AI tools from the same standard, particularly when those tools are positioned as clinical decision support in high-stakes care environments.

The RAG premise deserves direct scrutiny here. The core commercial argument for clinical AI specialization is that connecting a language model to curated, peer-reviewed medical databases produces outputs that are more accurate, more current, and more clinically aligned than a general-purpose model operating from its training data. This study found the opposite. That does not settle the question permanently, because RAG implementation quality varies widely and this evaluation tested specific products at a specific point in time. But it does mean that the argument cannot be accepted on faith, and any health system deploying a RAG-enhanced clinical AI tool should demand independent benchmark data before assuming the enhancement delivers what it promises.

From a patient safety perspective, the implications are not hypothetical. If a clinician is using a specialized clinical AI tool to support a differential diagnosis, review a drug interaction, or interpret a guideline recommendation, and that tool is performing at the level of a general internet search feature, the margin for consequential error is wider than the tool’s marketing would suggest. Patients have no mechanism to know which AI tool, if any, influenced a clinical decision in their care. That asymmetry of information is a structural problem, and studies like this one are the beginning of the accountability infrastructure that addresses it.

The methodological contribution of the RCQ benchmark is worth emphasizing separately. Evaluating AI on USMLE-style questions tells us something useful about factual recall and structured reasoning. Evaluating AI on HealthBench tells us something useful about value alignment. But neither captures the actual information environment of clinical practice, where questions are urgent, partial, and shaped by patient-specific context. A benchmark built from real physician queries, scored by blinded clinicians, is closer to the ground truth of clinical utility. Broader adoption of this framework, or its successors, would meaningfully raise the evidentiary floor for clinical AI deployment decisions.

Prior Research Context

Prior evaluations of LLM performance in medicine have largely focused on USMLE-style accuracy, with GPT-4 and its successors demonstrating passing-level performance on the USMLE Step 1 through Step 3 examinations in multiple independent evaluations published between 2023 and 2025. These studies established that frontier general-purpose LLMs are capable of encoding substantial medical knowledge but did not directly compare them to purpose-built clinical AI tools under head-to-head conditions.

HealthBench, introduced as an evaluation framework more recently, represented a methodological advance by incorporating clinician-derived criteria for response quality rather than relying solely on answer-key accuracy. Its inclusion in this study reflects the field’s recognition that factual correctness is a necessary but insufficient criterion for clinical AI utility.

Comparative evaluations of clinical AI tools against general-purpose models have been limited, partly because vendors of proprietary clinical AI platforms have had little incentive to sponsor or facilitate independent head-to-head comparisons. The few prior analyses that approached this question were typically narrower in scope, focused on single specialties, or relied on automated rather than clinician-scored outcomes. This study is among the first to apply a three-stage, multi-benchmark, blinded-clinician framework to a direct comparison between these two categories of AI tools at the frontier of general-purpose model capability.

Practical Implications

For clinicians, this study provides evidence-based justification for skepticism toward claims that a clinical AI tool’s specialized design makes it safer or more reliable than freely available alternatives. Clinicians who have access to frontier general-purpose LLMs and are being directed toward proprietary clinical AI platforms by their institutions should ask what independent performance data support that recommendation.

For health system administrators and clinical informatics leaders, the study argues directly against assuming that proprietary and purpose-built implies superior. Pre-deployment evaluation against relevant benchmarks, including clinician-rated real-world query performance, should be a standard component of clinical AI procurement. Vendor-provided performance claims are not a substitute for independent assessment.

For patients, the relevant implication is that the AI tools influencing clinical decisions in their care have not necessarily been subjected to rigorous independent testing before adoption. Asking providers whether AI tools are used in their care and what evidence supports those tools’ clinical safety and accuracy is a reasonable and appropriate question.

For policymakers and regulators, this study illustrates the evidentiary gap created by the absence of mandatory pre-market performance disclosure for clinical AI tools. The FDA’s current regulatory framework for AI-based Software as a Medical Device (SaMD) does not require the kind of comparative benchmark disclosure that would make findings like these available before adoption rather than after.

Limitations

Several limitations qualify the scope of these findings. The study is a benchmark evaluation, not a clinical outcomes study. Superior performance on MedQA, HealthBench, or the RCQ benchmark does not establish that general-purpose LLMs improve patient outcomes or reduce clinical errors in practice. That question requires a different study design.

The 12-clinician evaluator panel is small, and the absence of reported demographic, specialty, and practice-setting information prevents assessment of whether the panel is representative of U.S. clinical practice. Evaluator composition could influence quality ratings in ways that are not visible from the available data.

The proprietary architectures of OpenEvidence and UpToDate Expert AI are undisclosed. It is not possible to determine from this study whether the observed underperformance reflects fundamental limitations in the tools’ underlying models, suboptimal RAG implementation, outdated training data, or other factors. Nor is it possible to confirm that the versions tested in this evaluation match current production deployments, given the rapid iteration cycles of commercial AI products.

Specific numerical performance data, confidence intervals, and significance testing results were not available in the data used for this synthesis, limiting the ability to convey the magnitude and precision of reported differences. The full manuscript should be consulted for these figures.

Finally, this evaluation was conducted at a single point in time. The LLM landscape changes rapidly, and performance rankings among specific model versions are subject to reversal as updates are released. The methodological framework is more durable than any specific ranking it produces.

Future Directions

The most important methodological next step is clinical outcomes linkage: prospective studies that measure whether clinicians using general-purpose versus clinical-specific AI tools differ on downstream metrics such as diagnostic accuracy, time to correct diagnosis, medication error rates, or patient-reported outcomes. Benchmark performance is a proxy; patient outcomes are the target.

Expansion of the RCQ framework to larger, specialty-stratified query sets with demographically characterized evaluator panels would increase the generalizability and granularity of clinician-rated evaluations. Domain-specific performance breakdowns across primary care, emergency medicine, oncology, and other high-acuity settings are needed before drawing specialty-level conclusions.

Longitudinal benchmark tracking, repeating evaluations as models are updated, would provide the field with a dynamic picture of how the performance gap between general-purpose and clinical-specific tools evolves. Given the pace of model development, a single cross-sectional evaluation has a limited shelf life.

Regulatory and policy research should examine what pre-market performance disclosure requirements would be both feasible and sufficient to inform clinical procurement decisions. Analogues from pharmaceutical comparative effectiveness frameworks and medical device performance standards offer relevant models.

Bottom Line

In a three-stage benchmark evaluation comparing five AI systems on medical knowledge, clinician-value alignment, and real-world clinical query performance, frontier general-purpose LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) outperformed two purpose-built clinical AI platforms (OpenEvidence, UpToDate Expert AI) across every measure. The specialized clinical tools performed at approximately the level of Google Search AI Overview on the most clinically relevant benchmark. These findings challenge the core premise that domain-specific AI design reliably confers clinical performance advantages, and they argue for mandatory independent benchmarking before clinical AI tools enter patient care environments.

Frequently Asked Questions

Does this study mean that general-purpose AI tools are safe to use for clinical decisions?

No. This study measures relative benchmark performance between AI systems; it does not establish that any AI tool is safe or appropriate for unsupervised clinical decision-making. General-purpose LLMs outperforming specialized clinical tools on these benchmarks means they scored higher on specific evaluations, not that they should be used without clinical oversight. All AI-generated clinical content requires verification by a qualified clinician who can integrate patient-specific context.

Why would specialized clinical AI tools perform worse than general-purpose models?

The study cannot fully answer this because the proprietary architectures of the clinical AI tools are undisclosed. Possible explanations include weaker underlying base models, suboptimal RAG pipeline implementation, training data that is less current or comprehensive than the pretraining corpora of frontier general-purpose models, or optimization objectives that trade raw performance for other properties such as citation formatting. Without architectural transparency, causal explanation is not possible.

Will these rankings hold as AI models continue to update?

Not necessarily. All of the models evaluated in this study will continue to be updated, and relative performance can shift substantially between versions. The specific rankings reported here reflect a defined evaluation at a defined point in time. The methodological framework, especially the RCQ benchmark design, is more likely to remain valuable than any particular ranking it produces.

Should patients be concerned if their provider uses a clinical AI tool?

Patients have a legitimate interest in knowing whether AI tools are used in their care and what evidence supports those tools. This study provides evidence that the “clinical AI” designation does not guarantee superior performance, and that some specialized tools currently in clinical use may not have been independently benchmarked before adoption. Asking providers directly about which AI tools, if any, are used in clinical decision-making is a reasonable question.

What is retrieval-augmented generation (RAG) and why does it matter here?

RAG is a technique that allows a language model to query external databases in real time, retrieving relevant documents before generating a response. Clinical AI tools often use RAG to connect models to curated medical literature, on the premise that grounding responses in peer-reviewed sources improves accuracy and reduces hallucination. This study found that RAG-enhanced clinical tools did not outperform general-purpose models operating without this enhancement, which challenges but does not definitively refute the theoretical benefit of RAG in clinical contexts.

Citation

Authors:

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nature Medicine. 2026. DOI: 10.1038/s41591-026-04431-5

This post was prepared by the CED Clinic clinical content team under the direction of Dr. Benjamin Caplan. It reflects findings reported in the cited publication and does not constitute medical advice. Patients should consult a qualified clinician before making any changes to their care.