ARXIV:2601.10611 · MULTIMODAL VISION-LANGUAGE MODELS · SUBMITTED 17 MAR · 21:43 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

arXiv

Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.

Blocked on Code›Score8.0Evidence unverified

Opportunity summary

Pain Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or…

METHOD

Full abstract

Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models.

WHY NOW

Multimodal Vision-Language Models moved forward this cycle; last verified April 2026. Public score 8.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainOpen-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.

Evidence0 refs | 0 sources | 33% coverage

Blockermissing authors

Analysis summary

Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.

Segment

Multimodal Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "121eaa06-d1d5-4666-9fb4-84130a055522", "arxiv_id": "2601.10611", "canonical_route": "/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding", "endpoints": { "paper_pack": "/api/v1/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding/paper-pack", "build_passport": "/api/v1/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding", "normalized_query": "2601.10611", "route": "/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding", "paper_ref": "molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding#webpage", "url": "https://sciencetostartup.com/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding", "name": "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding", "description": "Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding#scholarlyArticle", "headline": "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding", "description": "Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.", "url": "https://sciencetostartup.com/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding", "sameAs": "https://arxiv.org/abs/2601.10611", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2601.10611" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-01-15T17:27:44.000Z", "author": [ { "@type": "Person", "name": "Christopher Clark", "affiliation": { "@type": "Organization", "name": "Allen Institute for AI" } }, { "@type": "Person", "name": "Jieyu Zhang", "affiliation": { "@type": "Organization", "name": "University of Washington" } }, { "@type": "Person", "name": "Ranjay Krishna", "affiliation": { "@type": "Organization", "name": "University of Washington" } }, { "@type": "Person", "name": "Ali Farhadi", "affiliation": { "@type": "Organization", "name": "University of Washington" } }, { "@type": "Person", "name": "Zixian Ma", "affiliation": { "@type": "Organization", "name": "University of Washington" } } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal Vision-Language Models" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Molmo2: Open Weights and Data for Vision-Language Models wit", "item": "https://sciencetostartup.com/paper/molmo2-open-weights-and-data-for-vision-language-models-with-video-understanding-and-grounding" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"Molmo2: Open Weights and Data for Vision-Language Models wit\"?", "acceptedAnswer": { "@type": "Answer", "text": "Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Productize by creating a platform that integrates Molmo2 for end-users who need enhanced video understanding and event tracking capabilities. This could be packaged as an API for easy integration into existing video systems or as a standalone application." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A real-time video analysis tool for security systems that utilizes Molmo2's models to provide precise event detection and description, enhancing surveillance efficiency and accuracy." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "Molmo2 has the potential to replace proprietary video-language models by offering similar or better performance while being fully open-source, thus lowering the entry barrier for businesses and developers." } } ] } ] }

Competitive landscape

Open-source video-language models with state-of-the-art video grounding capabilities for applications in security, video search, and assistive technology.

Segment

Multimodal Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline