ARXIV:2601.05242 · REINFORCEMENT LEARNING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Multi-Reward RL Optimization: GDPO for Language Models

arXiv

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these…

METHOD

Full abstract

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors.

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainOptimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

ARXIV:2601.05242 · REINFORCEMENT LEARNING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Multi-Reward RL Optimization: GDPO for Language Models

arXiv

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

RESULT

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainOptimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Paper Pack

10.48550/arXiv.2601.05242

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Parse run linked

A document parse run is attached to this paper.

Proof status

unverified

0 refs; 0 sources; 17% coverage.

What was readable

linkedon file20 anchorsderived fallback36 indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

4.0

Time to MVP

MVP estimate missing

Commercial

No commercial flags on file

Export

Preparing verified analysis

lens / founder

PROBLEM

METHOD

RESULT

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 4.0/10.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
Optimize multi-reward reinforcement learning with GDPO for stable and precise model training. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 4.0/10 on the public viability pass. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 4.0/10.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

Constellation map

Paper-native neighborhood for concepts, methods, materials, markets, and competitors. Missing lanes stay labeled instead of disappearing behind commercialization gates.

Open full Signal Canvas

Concepts

not indexed

Methods

Materials

PDF linkedDocument parse run

Markets

Reinforcement Learning

Competitors

not indexed

Competitive landscape

Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Buzz

No indexed public discussion is attached to 2601.05242 yet. That is a visibility signal, not a blank module: the monitor is watching the public channels below.

Hacker News

Not indexed yet

Bluesky

Not indexed yet

PDF

Preview the source document here, or use the hero PDF action for a new tab.

References(36)

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

2025DeepSeek-AI, A. Liu et al.

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

2025Shih-Yang Liu, Xin Dong et al.

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

2025Vaishnavi Shrivastava, Ahmed Awadallah et al.

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

2025Jinyan Su, Claire Cardie

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

2025Wei Liu, Ruochen Zhou et al.

Qwen3 Technical Report

2025An Yang, Anfeng Li et al.

ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning

2025Jingyang Yi, Jiazheng Wang

ToolRL: Reward is All Tool Learning Needs

2025Cheng Qian, Emre Can Acikgoz et al.

Understanding R1-Zero-Like Training: A Critical Perspective

2025Zi-Yan Liu, Changyu Chen et al.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

2025Qiying Yu, Zheng Zhang et al.

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

2025Pranjal Aggarwal, S. Welleck

Training Language Models to Reason Efficiently

2025Daman Arora, Andrea Zanette

Process Reinforcement through Implicit Rewards

2025Ganqu Cui, Lifan Yuan et al.

Kimi k1.5: Scaling Reinforcement Learning with LLMs

2025Kimi Team, Angang Du et al.

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

2025Haotian Luo, Li Shen et al.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025Adam Suma, Samuel Dauncey

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

2025Shishir G. Patil, Huanzhi Mao et al.

Rule Based Rewards for Language Model Safety

2024Tong Mu, Alec Helyar et al.

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

2024Qiqiang Lin, Muning Wen et al.

HybridFlow: A Flexible and Efficient RLHF Framework

2024Guangming Sheng, Chi Zhang et al.

Showing 20 of 36 references

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

4.0

Extension

Builds On ThisMDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

0.0

Commercially relevant

Higher ViabilityUnifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

7.0

Higher ViabilityiGRPO: Self-Feedback-Driven LLM Reasoning

6.0

Higher ViabilityGroup Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

5.0

Higher ViabilityMC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

6.0

Higher ViabilityHierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

6.0

Higher ViabilityStabilizing Rubric Integration Training via Decoupled Advantage Normalization

7.0

Higher ViabilityERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

7.0

Conflicting

Competing ApproachOn the Hidden Objective Biases of Group-based Reinforcement Learning

2.0

Related Resources

Just-In-Time Reinforcement Learning(glossary)
Multi-Agent Reinforcement Learning(glossary)
Multi-Agent Test-Time Reinforcement Learning (MATTRL)(glossary)
How does PRISM improve reinforcement learning?(question)
What is the significance of reinforcement learning in AI?(question)
How does RetroAgent improve reinforcement learning?(question)
Reinforcement Learning – Use Cases(use_case)

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Agent drawer

5 surfaces preserved for agents. Humans can ignore.

Developer contracts, payload previews, evidence maps, and run controls stay here instead of the Read, Build, and Track workspace.

Run context

Paper: 2601.05242
Route: /paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization
Active tab: read
Artifact: gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

Available agents

Read extractor
Build planner
Track monitor
Competitive mapper
Related-paper scout

API/MCP endpoints

REST paper pack API/api/v1/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization/paper-pack
REST build passport API/api/v1/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization/build-passport
REST OpenAPI/api/openapi.json
MCP descriptor/api/mcp
MCP resourcesciencetostartup://surfaces/paper-workspace

Tool contracts

paper_packbuild_passportopportunity_kernelforesightsource_proofevidence_state

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "ddac1e01-a3de-4ed6-89de-e5c75f63c1b2",
  "arxiv_id": "2601.05242",
  "canonical_route": "/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization",
  "endpoints": {
    "paper_pack": "/api/v1/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization/paper-pack",
    "build_passport": "/api/v1/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Schema validation

paper-r2 contract: present
JSON-LD twin: SSR emitted
OpenAPI path parity: /api/openapi.json
MCP resource parity: paper-workspace

Job trace

queued: drawer opened by user action
running: inspect or copy payload
succeeded: payload available in SSR
failed: route errors appear in evidence cards

Evidence map

sources used: page freshness, source proof anchors, JSON-LD
missing sources: exposed by PaperPack and EvidenceState chips
derived fallbacks: marked unverified before handoff

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

stale

Proof freshness: stale
Proof status: unverified
Display score: 4/10
Last proof check: 2026-04-02
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 17%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

OpenAlex: pending — this preprint is not yet indexed by OpenAlex.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Canonical ID gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization | Route /paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2601.05242"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization",
  "normalized_query": "2601.05242",
  "route": "/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization",
  "paper_ref": "gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Not build-ready: Multi-Reward RL Optimization: GDPO for Language Models

/buildability/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

Ignoreblocked

Subject: Multi-Reward RL Optimization: GDPO for Language Models

Verdict

Ignore

Verdict is Ignore because current viability and proof state do not clear the buildability gate.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

No data, compute, hardware, memory, latency, dependency, or serving requirement receipt is attached.

Evidence ids

Receipt path

/buildability/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

Paper ref

gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization

arXiv id

2601.05242

Freshness

Generated at

2026-04-02T02:30:40.136Z

Evidence freshness

stale

Last verification

2026-04-02T02:30:40.136Z

Sources

References

Coverage

17%

Hash state

Lineage hash

332a61065d2f550a735d529a41ca33aa201d3d321921dba920b9556e6913fa85

Canonical opportunity-kernel lineage hash.

Signature state

External signature

unsigned_external

No founder, registry, pilot, or production-adoption signature is attached to this receipt.

Verification

not_verified

Verification is blocked until an external signature is provided.

Blockers

Missing: repo_url
Missing: references
Missing: proof_status
Missing: distribution_readiness_scores
Missing: paper_extraction_scorecards
Unknown: distribution readiness has not been computed yet
Unknown: proof verification has not been recorded yet

Verification pending / evidence receipt incomplete

repo_url

references

Missing proof, requirement, signature, approval, adoption, or telemetry fields are blockers and must not be inferred.

Open receipt API receipt Build Loop Signal Canvas Proof divergence Divergence API Brier outcomes API

Source Proof anchors

Visual citations from the paper document graph.

Source proof

Visual citation anchors from the paper document graph.

20 anchors

proof blockPage 390%

This equation defines the score or evaluation function that determines model quality.

Page and bbox are available; crop image is pending.

proof blockPage 590%

This equation defines the score or evaluation function that determines model quality.

Page and bbox are available; crop image is pending.

proof blockPage 690%

This equation defines the score or evaluation function that determines model quality.

Page and bbox are available; crop image is pending.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization#webpage",
      "url": "https://sciencetostartup.com/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization",
      "name": "GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization",
      "description": "Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization#scholarlyArticle",
      "headline": "Multi-Reward RL Optimization: GDPO for Language Models",
      "description": "Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.",
      "url": "https://sciencetostartup.com/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization",
      "sameAs": "https://arxiv.org/abs/2601.05242",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2601.05242"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-01-08T18:59:24.000Z",
      "citation": [
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "1dc2281d4df5bd34f66aeb10e5b4741a27e23a9a"
          },
          "url": "https://www.semanticscholar.org/paper/1dc2281d4df5bd34f66aeb10e5b4741a27e23a9a"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "a7220bfa55807288b9af52ea11630cef3506225a"
          },
          "url": "https://www.semanticscholar.org/paper/a7220bfa55807288b9af52ea11630cef3506225a"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "78e03fb22a051c82bfa9e2051cd66245eba0f2dc"
          },
          "url": "https://www.semanticscholar.org/paper/78e03fb22a051c82bfa9e2051cd66245eba0f2dc"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "0ba8cbebdbb3a91f984313d33e76fe5946b96ad4"
          },
          "url": "https://www.semanticscholar.org/paper/0ba8cbebdbb3a91f984313d33e76fe5946b96ad4"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "89510d9c471bd77404ed0901b644bfd9871329f0"
          },
          "url": "https://www.semanticscholar.org/paper/89510d9c471bd77404ed0901b644bfd9871329f0"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "d2d84d56f730f81d276a02b48d5d44db5bde0b4a"
          },
          "url": "https://www.semanticscholar.org/paper/d2d84d56f730f81d276a02b48d5d44db5bde0b4a"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "5ac754e6d6b0ea662b19e83f445aec8f03569e75"
          },
          "url": "https://www.semanticscholar.org/paper/5ac754e6d6b0ea662b19e83f445aec8f03569e75"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e42054d042e2e5b3efe6e96cd6dd1c76ab3ca358"
          },
          "url": "https://www.semanticscholar.org/paper/e42054d042e2e5b3efe6e96cd6dd1c76ab3ca358"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "de23d38bc2604dcf334dcc46aff217eb6bcd1fe1"
          },
          "url": "https://www.semanticscholar.org/paper/de23d38bc2604dcf334dcc46aff217eb6bcd1fe1"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "dd4cfde3e135f799a9a71b4f57e13a29de89f7e3"
          },
          "url": "https://www.semanticscholar.org/paper/dd4cfde3e135f799a9a71b4f57e13a29de89f7e3"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "0681d80cd58b6135534cf279a69f5f999f47ebca"
          },
          "url": "https://www.semanticscholar.org/paper/0681d80cd58b6135534cf279a69f5f999f47ebca"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "03fa0740512a47758940415a6b3c1a635d9aca98"
          },
          "url": "https://www.semanticscholar.org/paper/03fa0740512a47758940415a6b3c1a635d9aca98"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e23379a9752f57732d311f7a97af2c69af6fae7b"
          },
          "url": "https://www.semanticscholar.org/paper/e23379a9752f57732d311f7a97af2c69af6fae7b"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "668075792a7ab40457d92e09da28d35c879271c3"
          },
          "url": "https://www.semanticscholar.org/paper/668075792a7ab40457d92e09da28d35c879271c3"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "09cedd18546c334891df025aa3a4e21658affa23"
          },
          "url": "https://www.semanticscholar.org/paper/09cedd18546c334891df025aa3a4e21658affa23"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "131a13c60f179511572abc81d6bd6aa988e96854"
          },
          "url": "https://www.semanticscholar.org/paper/131a13c60f179511572abc81d6bd6aa988e96854"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "cd1152ba356ba13ca700131ab55296fcaf9baa3a"
          },
          "url": "https://www.semanticscholar.org/paper/cd1152ba356ba13ca700131ab55296fcaf9baa3a"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f2d0f3d47ae850f49a58f4977393bd0025af4bec"
          },
          "url": "https://www.semanticscholar.org/paper/f2d0f3d47ae850f49a58f4977393bd0025af4bec"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "a75da880b921d81426800a9893ce7c743339b278"
          },
          "url": "https://www.semanticscholar.org/paper/a75da880b921d81426800a9893ce7c743339b278"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "0350636522997217df53553ddf3e472338bca97b"
          },
          "url": "https://www.semanticscholar.org/paper/0350636522997217df53553ddf3e472338bca97b"
        }
      ],
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 4
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Reinforcement Learning"
        }
      ],
      "keywords": [
        "multi-reward reinforcement learning optimization",
        "language model alignment with human preferences",
        "group reward decoupled normalization policy optimization",
        "improving language model training stability",
        "tool calling math reasoning coding reasoning RL"
      ]
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Reinforcement Learning",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "GDPO: Group reward-Decoupled Normalization Policy Optimizati",
          "item": "https://sciencetostartup.com/paper/gdpo-group-reward-decoupled-normalization-policy-optimization-for-multi-reward-rl-optimization"
        }
      ]
    }
  ]
}

Multi-Reward RL Optimization: GDPO for Language Models

Multi-Reward RL Optimization: GDPO for Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(36)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(36)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline