ARXIV:2604.27865 · SEQUENTIAL DECISION MAKING · SUBMITTED 01 MAY · 15:04 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Thomas Grady · Kip Parker · Iliyan Zarov · Henry Course · Chengxi Taylor · Ross Taylor · arXiv

Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.

Evidence 0 refs | 4 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals.

METHOD

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals.

Full abstract

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at https://openreward.ai/GeneralReasoning/KellyBench.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. A public repository is linked, so build…

WHY NOW

Sequential Decision Making moved forward this cycle; last verified May 2026. Public score 8.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainIntroducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.

Evidence0 refs | 4 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.

Segment

Sequential Decision Making

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "85237565-7f18-4a53-9078-d69c15f4bc19", "arxiv_id": "2604.27865", "canonical_route": "/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "kellybench-a-benchmark-for-long-horizon-sequential-decision-making", "endpoints": { "paper_pack": "/api/v1/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making/paper-pack", "build_passport": "/api/v1/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "KellyBench: A Benchmark for Long-Horizon Sequential Decision Making", "normalized_query": "2604.27865", "route": "/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making", "paper_ref": "kellybench-a-benchmark-for-long-horizon-sequential-decision-making", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making#webpage", "url": "https://sciencetostartup.com/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making", "name": "KellyBench: A Benchmark for Long-Horizon Sequential Decision Making", "description": "Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making#scholarlyArticle", "headline": "KellyBench: A Benchmark for Long-Horizon Sequential Decision Making", "description": "Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.", "url": "https://sciencetostartup.com/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making", "sameAs": "https://arxiv.org/abs/2604.27865", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.27865" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-30T13:47:22.000Z", "author": [ { "@type": "Person", "name": "Thomas Grady" }, { "@type": "Person", "name": "Kip Parker" }, { "@type": "Person", "name": "Iliyan Zarov" }, { "@type": "Person", "name": "Henry Course" }, { "@type": "Person", "name": "Chengxi Taylor" }, { "@type": "Person", "name": "Ross Taylor" } ], "codeRepository": "https://github.com/GeneralReasoning/firehorse", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Sequential Decision Making" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making#software", "name": "KellyBench: A Benchmark for Long-Horizon Sequential Decision Making - Source Code", "description": "Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.", "codeRepository": "https://github.com/GeneralReasoning/firehorse", "url": "https://github.com/GeneralReasoning/firehorse" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Sequential Decision Making", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "KellyBench: A Benchmark for Long-Horizon Sequential Decision", "item": "https://sciencetostartup.com/paper/kellybench-a-benchmark-for-long-horizon-sequential-decision-making" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"KellyBench: A Benchmark for Long-Horizon Sequential Decision\"?", "acceptedAnswer": { "@type": "Answer", "text": "Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "KellyBench can be commercialized as a subscription-based tool for betting analysts, researchers, and AI developers to test and improve their models in a controlled, realistic sports betting environment." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "Commercial betting platforms could integrate KellyBench to offer enhanced predictive analytics tools to bettors, or as a training ground for developing AI-driven trading bots for sports markets." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "KellyBench could disrupt traditional sports betting analytics by providing a more dynamic and comprehensive environment for model testing, potentially replacing some aspects of human intuition-driven betting strategies with data-driven insights." } } ] } ] }

Competitive landscape

Introducing KellyBench: an Open-API for testing AI models in sequential sports betting scenarios.

Segment

Sequential Decision Making

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline