ARXIV:2604.27720 · MEDICAL AI · SUBMITTED 01 MAY · 15:05 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Xupeng Chen · Binbin Shi · Chenqian Le · Qifu Yin · Lang Lin · Haowei Ni · +2 at arXiv

Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs. We audit five recent frontier and grounding-aware VLMs (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on Medical VQA along two…

METHOD

Full abstract

Deploying vision-language models (VLMs) in clinical settings demands auditable behavior under realistic failure conditions, yet the failure landscape of frontier VLMs on specialized medical inputs is poorly characterized. We audit five recent frontier and grounding-aware VLMs (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on Medical VQA along two trust-relevant axes. Perception: all models localize anatomical and pathological targets poorly -- the best model reaches only 0.23 mean IoU and 19.1% Acc@0.5 -- and exhibit clinically dangerous laterality confusion. Pipeline integration: a self-grounding pipeline, where the same model localizes then answers, degrades VQA accuracy for every model -- driven by both inaccurate localization and format-compliance failures under the two-step prompt (parse failure rises to 70%--99% for Gemini and GPT-5 on VQA-RAD). Replacing predicted boxes with ground-truth annotations recovers and improves VQA accuracy, consistent with the failure residing in the perception module rather than in the decomposition itself. These observational findings identify grounding quality as a primary trustworthiness bottleneck in our SLAKE bounding-box setting. As a complementary fine-tuning follow-up, supervised fine-tuning of Qwen~2.5~VL on combined Med-VQA training data attains the highest reported SLAKE open-ended recall (85.5%) among comparable methods, suggesting that the VQA-level gap is tractable with domain adaptation; whether this also closes the perception/trustworthiness bottleneck is left to future work.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. Replacing predicted boxes with ground-truth annotations recovers and improves VQA accuracy, consistent with the failure residing in the perception module rather than in the…

WHY NOW

Medical AI moved forward this cycle; last verified May 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainAuditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.

Segment

Medical AI

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "78e8d6c9-d4f0-4181-a72d-05644b6c76ee", "arxiv_id": "2604.27720", "canonical_route": "/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt", "endpoints": { "paper_pack": "/api/v1/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt/paper-pack", "build_passport": "/api/v1/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation", "normalized_query": "2604.27720", "route": "/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt", "paper_ref": "auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt#webpage", "url": "https://sciencetostartup.com/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt", "name": "Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation", "description": "Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt#scholarlyArticle", "headline": "Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation", "description": "Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.", "url": "https://sciencetostartup.com/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt", "sameAs": "https://arxiv.org/abs/2604.27720", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.27720" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-30T11:11:47.000Z", "author": [ { "@type": "Person", "name": "Xupeng Chen" }, { "@type": "Person", "name": "Binbin Shi" }, { "@type": "Person", "name": "Chenqian Le" }, { "@type": "Person", "name": "Qifu Yin" }, { "@type": "Person", "name": "Lang Lin" }, { "@type": "Person", "name": "Haowei Ni" }, { "@type": "Person", "name": "Ran Gong" }, { "@type": "Person", "name": "Panfeng Li" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Medical AI" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Medical AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Auditing Frontier Vision-Language Models for Trustworthy Med", "item": "https://sciencetostartup.com/paper/auditing-frontier-vision-language-models-for-trustworthy-medical-vqa-grounding-failures-format-collapse-and-domain-adapt" } ] } ] }

Competitive landscape

Auditing frontier vision-language models for trustworthiness in medical VQA, identifying grounding failures and domain adaptation needs.

Segment

Medical AI

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline