ARXIV:2604.02288 · LLM TRAINING · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Gengsheng Li · Tianyu Yang · Junfeng Fang · Mingyang Song · Mao Zheng · Haiyun Guo · +3 at arXiv

A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse…

METHOD

Full abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. Code…

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "91bc8429-31ca-4bef-9976-1967bdbc046d", "arxiv_id": "2604.02288", "canonical_route": "/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing", "endpoints": { "paper_pack": "/api/v1/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing/paper-pack", "build_passport": "/api/v1/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing", "normalized_query": "2604.02288", "route": "/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing", "paper_ref": "unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing#webpage", "url": "https://sciencetostartup.com/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing", "name": "Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing", "description": "A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing#scholarlyArticle", "headline": "Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing", "description": "A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.", "url": "https://sciencetostartup.com/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing", "sameAs": "https://arxiv.org/abs/2604.02288", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02288" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T17:29:18.000Z", "author": [ { "@type": "Person", "name": "Gengsheng Li" }, { "@type": "Person", "name": "Tianyu Yang" }, { "@type": "Person", "name": "Junfeng Fang" }, { "@type": "Person", "name": "Mingyang Song" }, { "@type": "Person", "name": "Mao Zheng" }, { "@type": "Person", "name": "Haiyun Guo" }, { "@type": "Person", "name": "Dan Zhang" }, { "@type": "Person", "name": "Jinqiao Wang" }, { "@type": "Person", "name": "Tat-Seng Chua" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Unifying Group-Relative and Self-Distillation Policy Optimiz", "item": "https://sciencetostartup.com/paper/unifying-group-relative-and-self-distillation-policy-optimization-via-sample-routing" } ] } ] }

Competitive landscape

A unified reinforcement learning framework for large language models that improves training stability and efficiency by intelligently routing samples to different optimization strategies.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline