ARXIV:2603.13185 · SCENE GRAPH GENERATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

arXiv

A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward…

METHOD

Full abstract

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

WHY NOW

Scene Graph Generation moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.

Segment

Scene Graph Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "7b703c31-1b94-4cd7-92d6-f36a38db4954", "arxiv_id": "2603.13185", "canonical_route": "/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos", "endpoints": { "paper_pack": "/api/v1/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos/paper-pack", "build_passport": "/api/v1/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos", "normalized_query": "2603.13185", "route": "/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos", "paper_ref": "towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos#webpage", "url": "https://sciencetostartup.com/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos", "name": "Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos", "description": "A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos#scholarlyArticle", "headline": "Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos", "description": "A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.", "url": "https://sciencetostartup.com/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos", "sameAs": "https://arxiv.org/abs/2603.13185", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.13185" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-13T17:18:03.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Scene Graph Generation" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Scene Graph Generation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Towards Spatio-Temporal World Scene Graph Generation from Mo", "item": "https://sciencetostartup.com/paper/towards-spatio-temporal-world-scene-graph-generation-from-monocular-videos" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Now is the time because advancements in 3D reconstruction, transformer models, and vision-language AI have made persistent scene reasoning more feasible, while industries like logistics and autonomous systems are pushing for higher safety standards and operational efficiency. The market is ripe for solutions that go beyond basic object detection to provide holistic, predictive insights in dynamic settings." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A real-time monitoring system for construction sites that uses monocular cameras to track workers, equipment, and materials in 4D, predicting potential hazards like collisions or falls even when objects are temporarily obscured by structures or machinery. This could alert supervisors to intervene before accidents occur, reducing injuries and downtime." } } ] } ] }

Competitive landscape

A novel approach to generating spatio-temporal scene graphs from monocular videos, enhancing object interaction modeling.

Segment

Scene Graph Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline