ARXIV:2604.17654 · LLM REASONING · SUBMITTED 21 APR · 20:32 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Poly-EPO: Training Exploratory Reasoning Models

Ifdita Hasan Orney · Jubayer Ibn Hamid · Shreya S Ramanujam · Shirley Wu · Hengyuan Hu · Noah Goodman · +2 at arXiv

A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.

Evidence 0 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity. In this paper, we present a framework for post-training language models (LMs) that explicitly…

METHOD

Full abstract

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance…

WHY NOW

LLM Reasoning moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.

Evidence0 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.

Segment

LLM Reasoning

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "255738e5-db30-4f4b-9e59-e51db837cfcb", "arxiv_id": "2604.17654", "canonical_route": "/paper/poly-epo-training-exploratory-reasoning-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "poly-epo-training-exploratory-reasoning-models", "endpoints": { "paper_pack": "/api/v1/paper/poly-epo-training-exploratory-reasoning-models/paper-pack", "build_passport": "/api/v1/paper/poly-epo-training-exploratory-reasoning-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Poly-EPO: Training Exploratory Reasoning Models", "normalized_query": "2604.17654", "route": "/paper/poly-epo-training-exploratory-reasoning-models", "paper_ref": "poly-epo-training-exploratory-reasoning-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/poly-epo-training-exploratory-reasoning-models#webpage", "url": "https://sciencetostartup.com/paper/poly-epo-training-exploratory-reasoning-models", "name": "Poly-EPO: Training Exploratory Reasoning Models", "description": "A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/poly-epo-training-exploratory-reasoning-models#scholarlyArticle", "headline": "Poly-EPO: Training Exploratory Reasoning Models", "description": "A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.", "url": "https://sciencetostartup.com/paper/poly-epo-training-exploratory-reasoning-models", "sameAs": "https://arxiv.org/abs/2604.17654", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.17654" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-19T22:54:19.000Z", "author": [ { "@type": "Person", "name": "Ifdita Hasan Orney" }, { "@type": "Person", "name": "Jubayer Ibn Hamid" }, { "@type": "Person", "name": "Shreya S Ramanujam" }, { "@type": "Person", "name": "Shirley Wu" }, { "@type": "Person", "name": "Hengyuan Hu" }, { "@type": "Person", "name": "Noah Goodman" }, { "@type": "Person", "name": "Dorsa Sadigh" }, { "@type": "Person", "name": "Chelsea Finn" } ], "codeRepository": "https://github.com/goodfeli/dlbook_notation", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Reasoning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/poly-epo-training-exploratory-reasoning-models#software", "name": "Poly-EPO: Training Exploratory Reasoning Models - Source Code", "description": "A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.", "codeRepository": "https://github.com/goodfeli/dlbook_notation", "url": "https://github.com/goodfeli/dlbook_notation" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Reasoning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Poly-EPO: Training Exploratory Reasoning Models", "item": "https://sciencetostartup.com/paper/poly-epo-training-exploratory-reasoning-models" } ] } ] }

Competitive landscape

A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.

Segment

LLM Reasoning

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Poly-EPO: Training Exploratory Reasoning Models

Poly-EPO: Training Exploratory Reasoning Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline