ARXIV:2603.24866 · PHYSICAL GENERATIVE REASONING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

Luyu Yang · Yutong Dai · An Yan · Viraj Prabhu · Ran Xu · Zeyuan Chen · arXiv

A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed…

METHOD

Full abstract

The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Code availability is flagged in the production record; the public repository…

WHY NOW

Physical Generative Reasoning moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.

Segment

Physical Generative Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "8dec7b00-f6f0-47af-9e48-944d2c90bbbb", "arxiv_id": "2603.24866", "canonical_route": "/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning", "endpoints": { "paper_pack": "/api/v1/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning/paper-pack", "build_passport": "/api/v1/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning", "normalized_query": "2603.24866", "route": "/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning", "paper_ref": "how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning#webpage", "url": "https://sciencetostartup.com/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning", "name": "How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning", "description": "A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning#scholarlyArticle", "headline": "How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning", "description": "A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.", "url": "https://sciencetostartup.com/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning", "sameAs": "https://arxiv.org/abs/2603.24866", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.24866" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-25T23:13:28.000Z", "author": [ { "@type": "Person", "name": "Luyu Yang" }, { "@type": "Person", "name": "Yutong Dai" }, { "@type": "Person", "name": "An Yan" }, { "@type": "Person", "name": "Viraj Prabhu" }, { "@type": "Person", "name": "Ran Xu" }, { "@type": "Person", "name": "Zeyuan Chen" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Physical Generative Reasoning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Physical Generative Reasoning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "How Far Are Vision-Language Models from Constructing the Rea", "item": "https://sciencetostartup.com/paper/how-far-are-vision-language-models-from-constructing-the-real-world-a-benchmark-for-physical-generative-reasoning" } ] } ] }

Competitive landscape

A new benchmark and dataset for evaluating vision-language models' ability to reason about and construct physical objects based on structural and code compliance constraints.

Segment

Physical Generative Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline