ARXIV:2603.09206 · MULTIMODAL AI · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

arXiv

MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no…

METHOD

Full abstract

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning.

WHY NOW

Multimodal AI moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainMM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.

Segment

Multimodal AI

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "9b18e891-1c99-4113-a903-7d058f56bb90", "arxiv_id": "2603.09206", "canonical_route": "/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data", "endpoints": { "paper_pack": "/api/v1/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data/paper-pack", "build_passport": "/api/v1/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data", "normalized_query": "2603.09206", "route": "/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data", "paper_ref": "mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data#webpage", "url": "https://sciencetostartup.com/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data", "name": "MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data", "description": "MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data#scholarlyArticle", "headline": "MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data", "description": "MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.", "url": "https://sciencetostartup.com/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data", "sameAs": "https://arxiv.org/abs/2603.09206", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.09206" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-10T05:23:26.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal AI" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "MM-Zero: Self-Evolving Multi-Model Vision Language Models Fr", "item": "https://sciencetostartup.com/paper/mm-zero-self-evolving-multi-model-vision-language-models-from-zero-data" } ] } ] }

Competitive landscape

MM-Zero is a self-evolving framework for Vision Language Models that enhances reasoning capabilities without the need for data.

Segment

Multimodal AI

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline