ARXIV:2605.30637 · MEDICAL AI · SUBMITTED 01 JUN · 20:20 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie · Keqi Han · Yunpeng Xiao · Hejie Cui · Guanchen Wu · Ziyang Zhang · +4 at arXiv

EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge,…

METHOD

Full abstract

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on…

WHY NOW

Medical AI moved forward this cycle; last verified June 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainEHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.

Segment

Medical AI

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "007eb569-7537-456f-a326-4a1581f19e84", "arxiv_id": "2605.30637", "canonical_route": "/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms", "endpoints": { "paper_pack": "/api/v1/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms/paper-pack", "build_passport": "/api/v1/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs", "normalized_query": "2605.30637", "route": "/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms", "paper_ref": "ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms#webpage", "url": "https://sciencetostartup.com/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms", "name": "EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs", "description": "EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms#scholarlyArticle", "headline": "EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs", "description": "EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.", "url": "https://sciencetostartup.com/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms", "sameAs": "https://arxiv.org/abs/2605.30637", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.30637" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-28T22:38:26.000Z", "author": [ { "@type": "Person", "name": "Yuzhang Xie" }, { "@type": "Person", "name": "Keqi Han" }, { "@type": "Person", "name": "Yunpeng Xiao" }, { "@type": "Person", "name": "Hejie Cui" }, { "@type": "Person", "name": "Guanchen Wu" }, { "@type": "Person", "name": "Ziyang Zhang" }, { "@type": "Person", "name": "Kai Shu" }, { "@type": "Person", "name": "Jiaying Lu" }, { "@type": "Person", "name": "Xiao Hu" }, { "@type": "Person", "name": "Carl Yang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Medical AI" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Medical AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "EHRBench: An Automated and Reliable EHR-based Benchmark for ", "item": "https://sciencetostartup.com/paper/ehrbench-an-automated-and-reliable-ehr-based-benchmark-for-clinical-decision-making-with-llms" } ] } ] }

Competitive landscape

EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.

Segment

Medical AI

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline