ARXIV:2602.12566 · REINFORCEMENT LEARNING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

arXiv

M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards. We can achieve expert-level performance in some specific domains via RLVR, such as coding or math.

METHOD

Full abstract

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. We can achieve expert-level performance in some specific domains via RLVR, such as coding or math.

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainM2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "8290bf9e-7f38-4414-9137-1bf6ff2c712c", "arxiv_id": "2602.12566", "canonical_route": "/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models/paper-pack", "build_passport": "/api/v1/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models", "normalized_query": "2602.12566", "route": "/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models", "paper_ref": "to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models", "name": "To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models", "description": "M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models#scholarlyArticle", "headline": "To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models", "description": "M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.", "url": "https://sciencetostartup.com/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models", "sameAs": "https://arxiv.org/abs/2602.12566", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2602.12566" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-02-13T03:25:13.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Reinforcement Learning" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Reinforcement Learning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "To Mix or To Merge: Toward Multi-Domain Reinforcement Learni", "item": "https://sciencetostartup.com/paper/to-mix-or-to-merge-toward-multi-domain-reinforcement-learning-for-large-language-models" } ] } ] }

Competitive landscape

M2RL enhances multi-domain reasoning in LLMs through Reinforcement Learning with Verifiable Rewards.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline