ARXIV:2603.21877 · LLM REASONING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

P^2O: Joint Policy and Prompt Optimization

Xinyu Lu · Kaiqi Zhang · Jinglin Yang · Boxi Cao · Yaojie Lu · Hongyu Lin · +3 at arXiv

Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples. However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates.

METHOD

Full abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high…

WHY NOW

LLM Reasoning moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainOptimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.

Segment

LLM Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "926974cd-1247-4412-a85f-5e5c77d4af74", "arxiv_id": "2603.21877", "canonical_route": "/paper/p-2o-joint-policy-and-prompt-optimization", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "p-2o-joint-policy-and-prompt-optimization", "endpoints": { "paper_pack": "/api/v1/paper/p-2o-joint-policy-and-prompt-optimization/paper-pack", "build_passport": "/api/v1/paper/p-2o-joint-policy-and-prompt-optimization/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "P^2O: Joint Policy and Prompt Optimization", "normalized_query": "2603.21877", "route": "/paper/p-2o-joint-policy-and-prompt-optimization", "paper_ref": "p-2o-joint-policy-and-prompt-optimization", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/p-2o-joint-policy-and-prompt-optimization#webpage", "url": "https://sciencetostartup.com/paper/p-2o-joint-policy-and-prompt-optimization", "name": "P^2O: Joint Policy and Prompt Optimization", "description": "Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/p-2o-joint-policy-and-prompt-optimization#scholarlyArticle", "headline": "P^2O: Joint Policy and Prompt Optimization", "description": "Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.", "url": "https://sciencetostartup.com/paper/p-2o-joint-policy-and-prompt-optimization", "sameAs": "https://arxiv.org/abs/2603.21877", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.21877" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-23T12:08:47.000Z", "author": [ { "@type": "Person", "name": "Xinyu Lu" }, { "@type": "Person", "name": "Kaiqi Zhang" }, { "@type": "Person", "name": "Jinglin Yang" }, { "@type": "Person", "name": "Boxi Cao" }, { "@type": "Person", "name": "Yaojie Lu" }, { "@type": "Person", "name": "Hongyu Lin" }, { "@type": "Person", "name": "Min He" }, { "@type": "Person", "name": "Xianpei Han" }, { "@type": "Person", "name": "Le Sun" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Reasoning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Reasoning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "P^2O: Joint Policy and Prompt Optimization", "item": "https://sciencetostartup.com/paper/p-2o-joint-policy-and-prompt-optimization" } ] } ] }

Competitive landscape

Optimize LLM reasoning by jointly evolving prompts and policies to efficiently learn from challenging examples.

Segment

LLM Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

P^2O: Joint Policy and Prompt Optimization

P^2O: Joint Policy and Prompt Optimization

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline