ARXIV:2605.07817 · VISION-LANGUAGE MODELS · SUBMITTED 11 MAY · 20:45 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Brown Ebouky · Gabriele Carrino · Niccolo Avogaro · Christoph Studer · Andrea Bartezzaghi · Mattia Rigotti · arXiv

GazeVLM is a multimodal architecture that internalizes metacognitive control over attention for active vision, enabling dynamic gaze token generation and surpassing state-of-the-art VLMs in high-resolution multimodal reasoning.

Blocked on Code›Score6.0Evidence unverified

Opportunity summary

Pain GazeVLM is a multimodal architecture that internalizes metacognitive control over attention for active vision, enabling dynamic gaze token generation and surpassing state-of-the-art VLMs in high-resolution multimodal reasoning.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{<LOOK>}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping…

WHY NOW

Vision-Language Models moved forward this cycle; last verified May 2026. Public score 6.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainGazeVLM is a multimodal architecture that internalizes metacognitive control over attention for active vision, enabling dynamic gaze token generation and surpassing state-of-the-art VLMs in high-resolution multimodal reasoning.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Brown Ebouky · Gabriele Carrino · Niccolo Avogaro · Christoph Studer · Andrea Bartezzaghi · Mattia Rigotti · arXiv

Competitive landscape

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5007b44f-3cc4-4b54-85d7-eda5922bcf91", "arxiv_id": "2605.07817", "canonical_route": "/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning", "endpoints": { "paper_pack": "/api/v1/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning/paper-pack", "build_passport": "/api/v1/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning", "normalized_query": "2605.07817", "route": "/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning", "paper_ref": "gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning#webpage", "url": "https://sciencetostartup.com/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning", "name": "GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning", "description": "GazeVLM is a multimodal architecture that internalizes metacognitive control over attention for active vision, enabling dynamic gaze token generation and surpassing state-of-the-art VLMs in high-resolution multimodal reasoning.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning#scholarlyArticle", "headline": "GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning", "description": "GazeVLM is a multimodal architecture that internalizes metacognitive control over attention for active vision, enabling dynamic gaze token generation and surpassing state-of-the-art VLMs in high-resolution multimodal reasoning.", "url": "https://sciencetostartup.com/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning", "sameAs": "https://arxiv.org/abs/2605.07817", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.07817" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-08T14:49:10.000Z", "author": [ { "@type": "Person", "name": "Brown Ebouky" }, { "@type": "Person", "name": "Gabriele Carrino" }, { "@type": "Person", "name": "Niccolo Avogaro" }, { "@type": "Person", "name": "Christoph Studer" }, { "@type": "Person", "name": "Andrea Bartezzaghi" }, { "@type": "Person", "name": "Mattia Rigotti" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Models" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "GazeVLM: Active Vision via Internal Attention Control for Mu", "item": "https://sciencetostartup.com/paper/gazevlm-active-vision-via-internal-attention-control-for-multimodal-reasoning" } ] } ] }

Competitive landscape

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline