ARXIV:2603.26535 · LLM TRAINING · SUBMITTED 30 MAR · 21:52 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Zelin Tan · Zhouliang Yu · Bohan Lin · Zijie Geng · Hejia Geng · Yudong Zhang · +6 at arXiv

A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.

Evidence 21 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically…

METHOD

Full abstract

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve…

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.

Evidence21 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "24670ee7-83c4-4888-a214-cffc2d72c931", "arxiv_id": "2603.26535", "canonical_route": "/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "stabilizing-rubric-integration-training-via-decoupled-advantage-normalization", "endpoints": { "paper_pack": "/api/v1/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization/paper-pack", "build_passport": "/api/v1/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Stabilizing Rubric Integration Training via Decoupled Advantage Normalization", "normalized_query": "2603.26535", "route": "/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization", "paper_ref": "stabilizing-rubric-integration-training-via-decoupled-advantage-normalization", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization#webpage", "url": "https://sciencetostartup.com/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization", "name": "Stabilizing Rubric Integration Training via Decoupled Advantage Normalization", "description": "A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization#scholarlyArticle", "headline": "Stabilizing Rubric Integration Training via Decoupled Advantage Normalization", "description": "A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.", "url": "https://sciencetostartup.com/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization", "sameAs": "https://arxiv.org/abs/2603.26535", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26535" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T15:48:13.000Z", "author": [ { "@type": "Person", "name": "Zelin Tan" }, { "@type": "Person", "name": "Zhouliang Yu" }, { "@type": "Person", "name": "Bohan Lin" }, { "@type": "Person", "name": "Zijie Geng" }, { "@type": "Person", "name": "Hejia Geng" }, { "@type": "Person", "name": "Yudong Zhang" }, { "@type": "Person", "name": "Mulei Zhang" }, { "@type": "Person", "name": "Yang Chen" }, { "@type": "Person", "name": "Shuyue Hu" }, { "@type": "Person", "name": "Zhenfei Yin" }, { "@type": "Person", "name": "Chen Zhang" }, { "@type": "Person", "name": "Lei Bai" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Stabilizing Rubric Integration Training via Decoupled Advant", "item": "https://sciencetostartup.com/paper/stabilizing-rubric-integration-training-via-decoupled-advantage-normalization" } ] } ] }

Competitive landscape

A novel training method that improves LLM reasoning quality by decoupling outcome and process rewards, outperforming existing approaches on complex benchmarks.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline