ARXIV:2605.12969 · REINFORCEMENT LEARNING · SUBMITTED 14 MAY · 20:10 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

Feng Zhang · Xinhong Ma · Ziqiang Dong · Xi Leng · Jianfei Zhao · Xin Sun · +2 at arXiv

A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted…

METHOD

Full abstract

RLVR has become a widely adopted paradigm for improving LLMs' reasoning capabilities, and GRPO is one of its most representative algorithms. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under this view, GRPO increases sequence-level scores of verified positive rollouts and decreases those of negative rollouts, where the scores are averages of clipped token-level importance sampling ratios. This reformulation reveals two structural limitations of GRPO: likelihood-misaligned scoring, where clipped ratio-based surrogate scores are optimized instead of generation likelihoods, and score-insensitive credit assignment, where rollout-level credit is assigned without accounting for relative score gaps between positive and negative rollouts in the same group. To address these limitations, we propose ConSPO, a framework for Contrastive Sequence-level Policy Optimization in RLVR. ConSPO replaces GRPO's clipped ratio-based scores with length-normalized sequence log-probabilities, aligning the optimized rollout scores with the likelihoods used in autoregressive generation. It then optimizes a group-wise InfoNCE-style objective that contrasts each positive rollout against negative distractors from the same group, enabling credit assignment to depend on their relative scores. This contrastive formulation amplifies updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives. Moreover, ConSPO introduces a curriculum-scheduled margin, guiding optimization from coarse positive-negative ordering in early training toward stronger separation in later stages. Extensive evaluations across diverse backbone models, parameter scales, and training datasets show that ConSPO consistently outperforms several strong RLVR baselines on challenging mathematical reasoning benchmarks.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Code availability is flagged in…

WHY NOW

Reinforcement Learning moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "500c9e98-44a5-4f91-919f-8c216ef05a78", "arxiv_id": "2605.12969", "canonical_route": "/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective", "endpoints": { "paper_pack": "/api/v1/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective/paper-pack", "build_passport": "/api/v1/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective", "normalized_query": "2605.12969", "route": "/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective", "paper_ref": "revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective#webpage", "url": "https://sciencetostartup.com/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective", "name": "Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective", "description": "A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective#scholarlyArticle", "headline": "Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective", "description": "A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.", "url": "https://sciencetostartup.com/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective", "sameAs": "https://arxiv.org/abs/2605.12969", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.12969" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-13T04:02:36.000Z", "author": [ { "@type": "Person", "name": "Feng Zhang" }, { "@type": "Person", "name": "Xinhong Ma" }, { "@type": "Person", "name": "Ziqiang Dong" }, { "@type": "Person", "name": "Xi Leng" }, { "@type": "Person", "name": "Jianfei Zhao" }, { "@type": "Person", "name": "Xin Sun" }, { "@type": "Person", "name": "Yang Yang" }, { "@type": "Person", "name": "Guanjun Jiang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Reinforcement Learning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Reinforcement Learning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Revisiting Reinforcement Learning with Verifiable Rewards fr", "item": "https://sciencetostartup.com/paper/revisiting-reinforcement-learning-with-verifiable-rewards-from-a-contrastive-perspective" } ] } ] }

Competitive landscape

A contrastive sequence-level policy optimization framework for RLHF that improves LLM reasoning by aligning scores with generation likelihoods.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline