ARXIV:2605.13554 · REINFORCEMENT LEARNING · SUBMITTED 14 MAY · 20:10 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman · Sasha Abramowitz · Mark Bergh · Ulrich Armel Mbou Sob · Ruan John de Kock · Omayma Mahjoub · +10 at arXiv

A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL…

METHOD

Full abstract

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14…

WHY NOW

Reinforcement Learning moved forward this cycle; last verified May 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "cd5da885-b87a-44f9-a6f5-8707b7d57a57", "arxiv_id": "2605.13554", "canonical_route": "/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation", "endpoints": { "paper_pack": "/api/v1/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation/paper-pack", "build_passport": "/api/v1/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation", "normalized_query": "2605.13554", "route": "/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation", "paper_ref": "self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation#webpage", "url": "https://sciencetostartup.com/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation", "name": "Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation", "description": "A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation#scholarlyArticle", "headline": "Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation", "description": "A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.", "url": "https://sciencetostartup.com/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation", "sameAs": "https://arxiv.org/abs/2605.13554", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.13554" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-13T13:58:49.000Z", "author": [ { "@type": "Person", "name": "Asim Osman" }, { "@type": "Person", "name": "Sasha Abramowitz" }, { "@type": "Person", "name": "Mark Bergh" }, { "@type": "Person", "name": "Ulrich Armel Mbou Sob" }, { "@type": "Person", "name": "Ruan John de Kock" }, { "@type": "Person", "name": "Omayma Mahjoub" }, { "@type": "Person", "name": "Oussama Hidaoui" }, { "@type": "Person", "name": "Noah De Nicola" }, { "@type": "Person", "name": "Arnol Manuel Fokam" }, { "@type": "Person", "name": "Felix Chalumeau" }, { "@type": "Person", "name": "Daniel Rajaonarivonivelomanantsoa" }, { "@type": "Person", "name": "Siddarth Singh" }, { "@type": "Person", "name": "Refiloe Shabe" }, { "@type": "Person", "name": "Juan Claude Formanek" }, { "@type": "Person", "name": "Simon Verster Du Toit" }, { "@type": "Person", "name": "Arnu Pretorius" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Reinforcement Learning" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Reinforcement Learning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Self-Supervised On-Policy Reinforcement Learning via Contras", "item": "https://sciencetostartup.com/paper/self-supervised-on-policy-reinforcement-learning-via-contrastive-proximal-policy-optimisation" } ] } ] }

Competitive landscape

A new on-policy contrastive RL algorithm that matches or exceeds PPO's performance without handcrafted rewards, applicable to discrete and continuous tasks.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline