ARXIV:2604.00528 · AGENTS · SUBMITTED 02 APR · 20:57 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

Haibo Wang · Zihao Lin · Zhiyang Xu · Lifu Huang · arXiv

An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.

Evidence 31 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer…

METHOD

Full abstract

3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainAn agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.

Evidence31 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c2763512-9719-4075-9ec3-35b099a46915", "arxiv_id": "2604.00528", "canonical_route": "/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding", "endpoints": { "paper_pack": "/api/v1/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding/paper-pack", "build_passport": "/api/v1/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding", "normalized_query": "2604.00528", "route": "/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding", "paper_ref": "think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding#webpage", "url": "https://sciencetostartup.com/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding", "name": "Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding", "description": "An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding#scholarlyArticle", "headline": "Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding", "description": "An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.", "url": "https://sciencetostartup.com/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding", "sameAs": "https://arxiv.org/abs/2604.00528", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.00528" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-01T06:12:16.000Z", "author": [ { "@type": "Person", "name": "Haibo Wang" }, { "@type": "Person", "name": "Zihao Lin" }, { "@type": "Person", "name": "Zhiyang Xu" }, { "@type": "Person", "name": "Lifu Huang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Think, Act, Build: An Agentic Framework with Vision Language", "item": "https://sciencetostartup.com/paper/think-act-build-an-agentic-framework-with-vision-language-models-for-zero-shot-3d-visual-grounding" } ] } ] }

Competitive landscape

An agentic framework using Vision Language Models to perform zero-shot 3D visual grounding by dynamically reconstructing targets from RGB-D streams.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline