Recent advancements in spatial reasoning are reshaping how models interpret and interact with complex environments. A notable trend is the decoupling of perception and reasoning, enabling models to leverage structured geometric representations, such as 3D scene graphs, to enhance spatial understanding. This shift allows models to achieve significant performance improvements in tasks like map-to-street-view reasoning and embodied question answering, where traditional methods often falter. For instance, new frameworks like Chain-of-View prompting facilitate dynamic viewpoint adjustments, enabling models to gather context from multiple angles, thereby improving accuracy in spatial tasks. Additionally, datasets designed to isolate spatial reasoning from visual inputs reveal that while models grasp basic concepts, they struggle with more nuanced spatial relationships. These developments not only enhance the capabilities of multimodal foundation models but also hold promise for applications in robotics, navigation systems, and augmented reality, where robust spatial reasoning is critical for real-world interactions.
Achieving robust spatial reasoning remains a fundamental challenge for current Multimodal Foundation Models (MFMs). Existing methods either overfit statistical shortcuts via 3D grounding data or remai...
We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inpu...
Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egoce...
Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-e...
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language mode...