ARXIV:2603.23926 · REINFORCEMENT LEARNING THEORY · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir · Matthew Zurek · Yudong Chen · arXiv

Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.

Blocked on Code›Score2.0Evidence unverified

Opportunity summary

Pain Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret.

METHOD

Full abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

RESULT

ScienceToStartup currently rates this 2.0/10 on the public viability pass. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees.

WHY NOW

Reinforcement Learning Theory moved forward this cycle; last verified April 2026. Public score 2.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score2.0

PainDevelops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.

Segment

Reinforcement Learning Theory

Adoption evidence

No public code link in the paper record yet

Commercial read

2.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5b4d8017-1d92-4744-b9ce-08db78245a7c", "arxiv_id": "2603.23926", "canonical_route": "/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps", "endpoints": { "paper_pack": "/api/v1/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps/paper-pack", "build_passport": "/api/v1/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs", "normalized_query": "2603.23926", "route": "/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps", "paper_ref": "optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps#webpage", "url": "https://sciencetostartup.com/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps", "name": "Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs", "description": "Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps#scholarlyArticle", "headline": "Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs", "description": "Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.", "url": "https://sciencetostartup.com/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps", "sameAs": "https://arxiv.org/abs/2603.23926", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.23926" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-25T04:34:19.000Z", "author": [ { "@type": "Person", "name": "Guy Zamir" }, { "@type": "Person", "name": "Matthew Zurek" }, { "@type": "Person", "name": "Yudong Chen" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 2 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Reinforcement Learning Theory" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Reinforcement Learning Theory", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Optimal Variance-Dependent Regret Bounds for Infinite-Horizo", "item": "https://sciencetostartup.com/paper/optimal-variance-dependent-regret-bounds-for-infinite-horizon-mdps" } ] } ] }

Competitive landscape

Develops theoretical optimal regret bounds for infinite-horizon reinforcement learning problems.

Segment

Reinforcement Learning Theory

Adoption evidence

No public code link in the paper record yet

Commercial read

2.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline