ARXIV:2605.14054 · VISION-LANGUAGE MODELS · SUBMITTED 15 MAY · 20:14 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Haozhe Wang · Qixin Xu · Changpeng Wang · Taofeng Xue · Chong Peng · Wenhu Chen · +1 at arXiv

A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity. Recent advancements have pursued this goal via architectural designs or agentic workflows.

METHOD

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows.

Full abstract

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity.

WHY NOW

Vision-Language Models moved forward this cycle; last verified May 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "56d5284f-8af9-4a1d-b45f-c4ae5a2ba1bb", "arxiv_id": "2605.14054", "canonical_route": "/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning", "endpoints": { "paper_pack": "/api/v1/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning/paper-pack", "build_passport": "/api/v1/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning", "normalized_query": "2605.14054", "route": "/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning", "paper_ref": "bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning#webpage", "url": "https://sciencetostartup.com/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning", "name": "Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning", "description": "A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning#scholarlyArticle", "headline": "Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning", "description": "A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.", "url": "https://sciencetostartup.com/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning", "sameAs": "https://arxiv.org/abs/2605.14054", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.14054" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-13T19:23:53.000Z", "author": [ { "@type": "Person", "name": "Haozhe Wang" }, { "@type": "Person", "name": "Qixin Xu" }, { "@type": "Person", "name": "Changpeng Wang" }, { "@type": "Person", "name": "Taofeng Xue" }, { "@type": "Person", "name": "Chong Peng" }, { "@type": "Person", "name": "Wenhu Chen" }, { "@type": "Person", "name": "Fangzhen Lin" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Models" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Bad Seeing or Bad Thinking? Rewarding Perception for Vision-", "item": "https://sciencetostartup.com/paper/bad-seeing-or-bad-thinking-rewarding-perception-for-vision-language-reasoning" } ] } ] }

Competitive landscape

A reinforcement learning framework that improves vision-language model synergy by rewarding perception fidelity, addressing the 'bad seeing or bad thinking' ambiguity.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline