ARXIV:2603.11413 · CONSUMER HEALTH AI · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

arXiv

This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI…

METHOD

Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks.

Full abstract

Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100\% of trials across all models and conditions. Asthma triage improved from 48\% to 80\%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0--24\% with forced choice but 100\% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. The headline under-triage rate is highly contingent on evaluation format and should not be interpreted as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results.

WHY NOW

Consumer Health AI moved forward this cycle; last verified April 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainThis paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.

Segment

Consumer Health AI

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "8511f955-5b5b-487d-8b14-aa8d203a41a3", "arxiv_id": "2603.11413", "canonical_route": "/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai", "endpoints": { "paper_pack": "/api/v1/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai/paper-pack", "build_passport": "/api/v1/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI", "normalized_query": "2603.11413", "route": "/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai", "paper_ref": "evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai#webpage", "url": "https://sciencetostartup.com/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai", "name": "Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI", "description": "This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai#scholarlyArticle", "headline": "Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI", "description": "This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.", "url": "https://sciencetostartup.com/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai", "sameAs": "https://arxiv.org/abs/2603.11413", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.11413" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-12T00:58:22.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Consumer Health AI" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Consumer Health AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Evaluation format, not model capability, drives triage failu", "item": "https://sciencetostartup.com/paper/evaluation-format-not-model-capability-drives-triage-failure-in-the-assessment-of-consumer-health-ai" } ] } ] }

Competitive landscape

This paper critiques the evaluation methods of consumer health AI triage systems, emphasizing the need for realistic testing conditions.

Segment

Consumer Health AI

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline