ARXIV:2602.05547 · LLM TRAINING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

arXiv

Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training. However, real-world deployment requires reliable performance across diverse tasks.

METHOD

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks.

Full abstract

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks.

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainImprove LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "94634640-e66c-4577-8daa-7c23bbdb6ef0", "arxiv_id": "2602.05547", "canonical_route": "/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "multi-task-grpo-reliable-llm-reasoning-across-tasks", "endpoints": { "paper_pack": "/api/v1/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks/paper-pack", "build_passport": "/api/v1/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Multi-Task GRPO: Reliable LLM Reasoning Across Tasks", "normalized_query": "2602.05547", "route": "/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks", "paper_ref": "multi-task-grpo-reliable-llm-reasoning-across-tasks", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks#webpage", "url": "https://sciencetostartup.com/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks", "name": "Multi-Task GRPO: Reliable LLM Reasoning Across Tasks", "description": "Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks#scholarlyArticle", "headline": "Multi-Task GRPO: Reliable LLM Reasoning Across Tasks", "description": "Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.", "url": "https://sciencetostartup.com/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks", "sameAs": "https://arxiv.org/abs/2602.05547", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2602.05547" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-02-05T11:06:37.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Multi-Task GRPO: Reliable LLM Reasoning Across Tasks", "item": "https://sciencetostartup.com/paper/multi-task-grpo-reliable-llm-reasoning-across-tasks" } ] } ] }

Competitive landscape

Improve LLM task adaptability and efficiency with Multi-Task GRPO, enhancing performance across diverse tasks through dynamic task weighting and efficient training.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline