ARXIV:2603.28069 · VISION-LANGUAGE MODELS · SUBMITTED 31 MAR · 20:53 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Christopher Clark · Yue Yang · Jae Sung Park · Zixian Ma · Jieyu Zhang · Rohun Tripathi · +5 at arXiv

A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.

Evidence 104 refs | 5 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks. Most existing VLMs point by generating coordinates…

METHOD

Full abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a…

WHY NOW

Vision-Language Models moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.

Evidence104 refs | 5 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "a52c93f4-d4aa-42c3-a9bd-3310e6483408", "arxiv_id": "2603.28069", "canonical_route": "/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "molmopoint-better-pointing-for-vlms-with-grounding-tokens", "endpoints": { "paper_pack": "/api/v1/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens/paper-pack", "build_passport": "/api/v1/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "MolmoPoint: Better Pointing for VLMs with Grounding Tokens", "normalized_query": "2603.28069", "route": "/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens", "paper_ref": "molmopoint-better-pointing-for-vlms-with-grounding-tokens", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens#webpage", "url": "https://sciencetostartup.com/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens", "name": "MolmoPoint: Better Pointing for VLMs with Grounding Tokens", "description": "A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens#scholarlyArticle", "headline": "MolmoPoint: Better Pointing for VLMs with Grounding Tokens", "description": "A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.", "url": "https://sciencetostartup.com/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens", "sameAs": "https://arxiv.org/abs/2603.28069", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.28069" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-30T06:15:06.000Z", "author": [ { "@type": "Person", "name": "Christopher Clark" }, { "@type": "Person", "name": "Yue Yang" }, { "@type": "Person", "name": "Jae Sung Park" }, { "@type": "Person", "name": "Zixian Ma" }, { "@type": "Person", "name": "Jieyu Zhang" }, { "@type": "Person", "name": "Rohun Tripathi" }, { "@type": "Person", "name": "Mohammadreza Salehi" }, { "@type": "Person", "name": "Sangho Lee" }, { "@type": "Person", "name": "Taira Anderson" }, { "@type": "Person", "name": "Winson Han" }, { "@type": "Person", "name": "Ranjay Krishna" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Models" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "MolmoPoint: Better Pointing for VLMs with Grounding Tokens", "item": "https://sciencetostartup.com/paper/molmopoint-better-pointing-for-vlms-with-grounding-tokens" } ] } ] }

Competitive landscape

A new pointing mechanism for VLMs that uses special tokens to directly select visual tokens, improving efficiency and state-of-the-art performance on image, GUI, and video pointing tasks.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline