ARXIV:2604.03098 · AGENTS · SUBMITTED 06 APR · 20:12 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang · Hanwei Wu · Jingwei Song · Shuyuan Zhang · Jiayi Zhang · Fanqi Kong · +5 at arXiv

Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited…

METHOD

Full abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Code availability is flagged in the…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainDevelops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "019d69c1-50ec-4a48-b097-cf95dcc12e57", "arxiv_id": "2604.03098", "canonical_route": "/paper/co-evolution-of-policy-and-internal-reward-for-language-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "co-evolution-of-policy-and-internal-reward-for-language-agents", "endpoints": { "paper_pack": "/api/v1/paper/co-evolution-of-policy-and-internal-reward-for-language-agents/paper-pack", "build_passport": "/api/v1/paper/co-evolution-of-policy-and-internal-reward-for-language-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Co-Evolution of Policy and Internal Reward for Language Agents", "normalized_query": "2604.03098", "route": "/paper/co-evolution-of-policy-and-internal-reward-for-language-agents", "paper_ref": "co-evolution-of-policy-and-internal-reward-for-language-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/co-evolution-of-policy-and-internal-reward-for-language-agents#webpage", "url": "https://sciencetostartup.com/paper/co-evolution-of-policy-and-internal-reward-for-language-agents", "name": "Co-Evolution of Policy and Internal Reward for Language Agents", "description": "Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/co-evolution-of-policy-and-internal-reward-for-language-agents#scholarlyArticle", "headline": "Co-Evolution of Policy and Internal Reward for Language Agents", "description": "Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.", "url": "https://sciencetostartup.com/paper/co-evolution-of-policy-and-internal-reward-for-language-agents", "sameAs": "https://arxiv.org/abs/2604.03098", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.03098" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-03T15:21:11.000Z", "author": [ { "@type": "Person", "name": "Xinyu Wang" }, { "@type": "Person", "name": "Hanwei Wu" }, { "@type": "Person", "name": "Jingwei Song" }, { "@type": "Person", "name": "Shuyuan Zhang" }, { "@type": "Person", "name": "Jiayi Zhang" }, { "@type": "Person", "name": "Fanqi Kong" }, { "@type": "Person", "name": "Tung Sum Thomas Kwok" }, { "@type": "Person", "name": "Xiao-Wen Chang" }, { "@type": "Person", "name": "Yuyu Luo" }, { "@type": "Person", "name": "Chenglin Wu" }, { "@type": "Person", "name": "Bang Liu" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Co-Evolution of Policy and Internal Reward for Language Agen", "item": "https://sciencetostartup.com/paper/co-evolution-of-policy-and-internal-reward-for-language-agents" } ] } ] }

Competitive landscape

Develops a novel self-generated internal reward system for LLM agents to improve long-horizon decision-making and policy optimization.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Co-Evolution of Policy and Internal Reward for Language Agents

Co-Evolution of Policy and Internal Reward for Language Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline