ARXIV:2601.19099 · SPATIAL REASONING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

arXiv

Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by…

METHOD

Full abstract

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with…

WHY NOW

Spatial Reasoning moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainLaunch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.

Segment

Spatial Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(19)

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

2025Shmuel Berman, Jia Deng

Spatial Mental Modeling from Limited Views

2025Baiqiao Yin, Qineng Wang et al.

CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation

2025Haihong Hao, Mingfei Han et al.

MapQA: Open-domain Geospatial Question Answering on Map Data

2025Zekun Li, Malcolm Grossman et al.

OriLoc: Unlimited-FoV and Orientation-Free Cross-View Geolocalization

2025Boni Hu, Haowei Li et al.

MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

2025Srija Mukhopadhyay, Abhishek Rajgaria et al.

CV-Cities: Advancing Cross-View Geo-Localization in Global Cities

2024Gaoshuang Huang, Yang Zhou et al.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

2024Elliot Glazer, Ege Erdil et al.

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

2024Chengke Zou, Xing-ming Guo et al.

SpaGBOL: Spatial-Graph-Based Orientated Localisation

2024Tavis Shore, Oscar Mendez et al.

Self-contradictory reasoning evaluation and detection

2023Ziyi Liu, Isabelle G. Lee et al.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

2023Pan Lu, Hritik Bansal et al.

PIGEON: Predicting Image Geolocations

2023Lukas Haas, Michal Skreta et al.

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

2022Simon Ott, Adriano Barbosa-Silva et al.

VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval

2020Sijie Zhu, Taojiannan Yang et al.

Revisiting Street-to-Aerial View Image Geo-localization and Orientation Estimation

2020Sijie Zhu, Taojiannan Yang et al.

Where Am I Looking At? Joint Location and Orientation Estimation by Cross-View Matching

2020Yujiao Shi, Xin Yu et al.

Natural Language Navigation and Spatial Reasoning in Visual Street Environments

2018Howard Chen

PlaNet - Photo Geolocation with Convolutional Neural Networks

2016Tobias Weyand, Ilya Kostrikov et al.

{ "contract_version": "paper-r2", "paper_id": "808e292d-ffac-4dec-afbf-2f45dfa7ff1a", "arxiv_id": "2601.19099", "canonical_route": "/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning", "endpoints": { "paper_pack": "/api/v1/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning/paper-pack", "build_passport": "/api/v1/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning", "normalized_query": "2601.19099", "route": "/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning", "paper_ref": "m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning#webpage", "url": "https://sciencetostartup.com/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning", "name": "m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning", "description": "Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning#scholarlyArticle", "headline": "m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning", "description": "Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.", "url": "https://sciencetostartup.com/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning", "sameAs": "https://arxiv.org/abs/2601.19099", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2601.19099" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-01-27T02:01:56.000Z", "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "8920bc8f45e19875ede42892e594573f7881d782" }, "url": "https://www.semanticscholar.org/paper/8920bc8f45e19875ede42892e594573f7881d782" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "dbef954f2ab4aa6de56913432d513ffb7b7a0660" }, "url": "https://www.semanticscholar.org/paper/dbef954f2ab4aa6de56913432d513ffb7b7a0660" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ec78da40c477602ded4ca55ebb866bf5262afc5e" }, "url": "https://www.semanticscholar.org/paper/ec78da40c477602ded4ca55ebb866bf5262afc5e" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "0ebaa30408fd364f1ae60d2a314add2a874c35da" }, "url": "https://www.semanticscholar.org/paper/0ebaa30408fd364f1ae60d2a314add2a874c35da" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d03fbae0edc30f8cb1d7c66615e2bb1fc1ff3cc6" }, "url": "https://www.semanticscholar.org/paper/d03fbae0edc30f8cb1d7c66615e2bb1fc1ff3cc6" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "e7fadf3ba6f389d9db67fe578c466e7f9610e0fc" }, "url": "https://www.semanticscholar.org/paper/e7fadf3ba6f389d9db67fe578c466e7f9610e0fc" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "8e35a4ac3378442c71d2b74b4958ca92c725eaa5" }, "url": "https://www.semanticscholar.org/paper/8e35a4ac3378442c71d2b74b4958ca92c725eaa5" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "e3233269cac473b56bddd2e453ed271b0f21c139" }, "url": "https://www.semanticscholar.org/paper/e3233269cac473b56bddd2e453ed271b0f21c139" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "aafc487e8a71c32daa151ce56f9c656a86631cab" }, "url": "https://www.semanticscholar.org/paper/aafc487e8a71c32daa151ce56f9c656a86631cab" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "8946891e94831adc8cddb0d32311cce2445c96d2" }, "url": "https://www.semanticscholar.org/paper/8946891e94831adc8cddb0d32311cce2445c96d2" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "96960e355ffbc287ab3a2dbd789839246345308f" }, "url": "https://www.semanticscholar.org/paper/96960e355ffbc287ab3a2dbd789839246345308f" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "43ae69101c302628b9f7186ec5f35f99bb89d5d6" }, "url": "https://www.semanticscholar.org/paper/43ae69101c302628b9f7186ec5f35f99bb89d5d6" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "7d2fe9c407a8e074964b23f42501fbd1991fa55b" }, "url": "https://www.semanticscholar.org/paper/7d2fe9c407a8e074964b23f42501fbd1991fa55b" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "5244ccf88ad651dfb2f30a241e5ffb97876e4e6a" }, "url": "https://www.semanticscholar.org/paper/5244ccf88ad651dfb2f30a241e5ffb97876e4e6a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "430f7ac05af90c1ea971596de897e8ac57ab22f9" }, "url": "https://www.semanticscholar.org/paper/430f7ac05af90c1ea971596de897e8ac57ab22f9" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "1b93a123d352ea09b4c8aaac933f1d4c8bd42009" }, "url": "https://www.semanticscholar.org/paper/1b93a123d352ea09b4c8aaac933f1d4c8bd42009" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "797189e8a7ecf6ab01f7ad2348f868c3ff25ade4" }, "url": "https://www.semanticscholar.org/paper/797189e8a7ecf6ab01f7ad2348f868c3ff25ade4" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "001f6f2bd13760c776f2a42b4a2f2214a4c16e93" }, "url": "https://www.semanticscholar.org/paper/001f6f2bd13760c776f2a42b4a2f2214a4c16e93" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "b5cc6634724b2238c88bcc324ec01a2c91c1b909" }, "url": "https://www.semanticscholar.org/paper/b5cc6634724b2238c88bcc324ec01a2c91c1b909" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Spatial Reasoning" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Spatial Reasoning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Re", "item": "https://sciencetostartup.com/paper/m2sv-a-scalable-benchmark-for-map-to-street-view-spatial-reasoning" } ] } ] }

Competitive landscape

Launch a benchmark for testing and improving spatial reasoning of vision-language models using map-to-street-view tasks.

Segment

Spatial Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(19)

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

2025Shmuel Berman, Jia Deng

Spatial Mental Modeling from Limited Views

2025Baiqiao Yin, Qineng Wang et al.

CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation

2025Haihong Hao, Mingfei Han et al.

MapQA: Open-domain Geospatial Question Answering on Map Data

2025Zekun Li, Malcolm Grossman et al.

OriLoc: Unlimited-FoV and Orientation-Free Cross-View Geolocalization

2025Boni Hu, Haowei Li et al.

MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

2025Srija Mukhopadhyay, Abhishek Rajgaria et al.

CV-Cities: Advancing Cross-View Geo-Localization in Global Cities

2024Gaoshuang Huang, Yang Zhou et al.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

2024Elliot Glazer, Ege Erdil et al.

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

2024Chengke Zou, Xing-ming Guo et al.

SpaGBOL: Spatial-Graph-Based Orientated Localisation

2024Tavis Shore, Oscar Mendez et al.

Self-contradictory reasoning evaluation and detection

2023Ziyi Liu, Isabelle G. Lee et al.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

2023Pan Lu, Hritik Bansal et al.

PIGEON: Predicting Image Geolocations

2023Lukas Haas, Michal Skreta et al.

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

2022Simon Ott, Adriano Barbosa-Silva et al.

VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval

2020Sijie Zhu, Taojiannan Yang et al.

Revisiting Street-to-Aerial View Image Geo-localization and Orientation Estimation

2020Sijie Zhu, Taojiannan Yang et al.

Where Am I Looking At? Joint Location and Orientation Estimation by Cross-View Matching

2020Yujiao Shi, Xin Yu et al.

Natural Language Navigation and Spatial Reasoning in Visual Street Environments

2018Howard Chen

PlaNet - Photo Geolocation with Convolutional Neural Networks

2016Tobias Weyand, Ilya Kostrikov et al.

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(19)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(19)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline