ARXIV:2603.25196 · MEDICAL AI · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan · Shuyu Dai · Jinglu Wang · Fengtao Zhou · Yan Lu · Xi Wang · +4 at arXiv

A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend…

METHOD

Full abstract

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for…

WHY NOW

Medical AI moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.

Segment

Medical AI

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "6d526c4e-5fc4-46c2-b251-e9dd774ca661", "arxiv_id": "2603.25196", "canonical_route": "/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation", "endpoints": { "paper_pack": "/api/v1/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation/paper-pack", "build_passport": "/api/v1/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations", "normalized_query": "2603.25196", "route": "/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation", "paper_ref": "a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation#webpage", "url": "https://sciencetostartup.com/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation", "name": "A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations", "description": "A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation#scholarlyArticle", "headline": "A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations", "description": "A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.", "url": "https://sciencetostartup.com/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation", "sameAs": "https://arxiv.org/abs/2603.25196", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.25196" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-26T09:00:55.000Z", "author": [ { "@type": "Person", "name": "Andong Tan" }, { "@type": "Person", "name": "Shuyu Dai" }, { "@type": "Person", "name": "Jinglu Wang" }, { "@type": "Person", "name": "Fengtao Zhou" }, { "@type": "Person", "name": "Yan Lu" }, { "@type": "Person", "name": "Xi Wang" }, { "@type": "Person", "name": "Yingcong Chen" }, { "@type": "Person", "name": "Can Yang" }, { "@type": "Person", "name": "Shujie Liu" }, { "@type": "Person", "name": "Hao Chen" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Medical AI" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Medical AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice ", "item": "https://sciencetostartup.com/paper/a-decade-scale-benchmark-evaluating-llms-clinical-practice-guidelines-detection-and-adherence-in-multi-turn-conversation" } ] } ] }

Competitive landscape

A benchmark to evaluate and improve LLM adherence to clinical practice guidelines, crucial for safe healthcare deployment.

Segment

Medical AI

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline