GTO Wizard Benchmark

PROBLEM

A standardized benchmark and API to evaluate AI agents in complex poker games, revealing significant gaps in current LLM reasoning capabilities. The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent…

METHOD

Full abstract

We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold'em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by $19.4$ $\pm$ $4.1$ bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Initial results and analysis reveal dramatic progress in LLM reasoning over recent years, yet all models remain far below the baseline established by our benchmark. Qualitative analysis reveals clear opportunities for improvement, including representation and the ability to reason over hidden states. This benchmark provides researchers with a precise and quantifiable setting to evaluate advances in planning and reasoning in multi-agent systems with partial observability.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical…

WHY NOW

Game AI moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Paper Pack

10.48550/arXiv.2603.23660

GTO Wizard Benchmark

A standardized benchmark and API to evaluate AI agents in complex poker games, revealing significant gaps in current LLM reasoning capabilities.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

unverified

0 refs; 0 sources; 17% coverage.

What was readable

linkedon filenot materializedderived fallbacknot indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

7.0

Time to MVP

MVP estimate missing

Commercial

code

Export

Preparing verified analysis

lens / agent

RESULT

PROBLEM

METHOD

WHY NOW

Game AI moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

PDF

Preview the source document here, or use the hero PDF action for a new tab.

REFERENCES

Reference metadata is not materialized in the public index yet. The source PDF remains the authority; cache refresh is optional.

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkGTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

7.0

Prior WorkGENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

7.0

Prior WorkAgentick: A Unified Benchmark for General Sequential Decision-Making Agents

7.0

Prior Work$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

GTO Wizard Benchmark | ScienceToStartup

Paper Pack

10.48550/arXiv.2603.23660

GTO Wizard Benchmark

A standardized benchmark and API to evaluate AI agents in complex poker games, revealing significant gaps in current LLM reasoning capabilities.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

unverified

0 refs; 0 sources; 17% coverage.

What was readable

linkedon filenot materializedderived fallbacknot indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

7.0

Time to MVP

MVP estimate missing

Commercial

code

Export

Preparing verified analysis

lens / agent

RESULT

PROBLEM

METHOD

WHY NOW

Game AI moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

PDF

Preview the source document here, or use the hero PDF action for a new tab.

REFERENCES

Reference metadata is not materialized in the public index yet. The source PDF remains the authority; cache refresh is optional.

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkGTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

7.0

Prior WorkGENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

7.0

Prior WorkAgentick: A Unified Benchmark for General Sequential Decision-Making Agents

7.0

Prior Work$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
A standardized benchmark and API to evaluate AI agents in complex poker games, revealing significant gaps in current LLM reasoning capabilities. The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by $19.4$ $\pm$ $4.1$ bb/100.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold'em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by $19.4$ $\pm$ $4.1$ bb/100.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 7.0/10 on the public viability pass. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. Code availability is flagged in the production record; the public repository link still needs proof alignment.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Game AI moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "63c156eb-c462-4c9a-a919-65c6bb6844c8",
  "arxiv_id": "2603.23660",
  "canonical_route": "/paper/gto-wizard-benchmark",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "gto-wizard-benchmark",
  "endpoints": {
    "paper_pack": "/api/v1/paper/gto-wizard-benchmark/paper-pack",
    "build_passport": "/api/v1/paper/gto-wizard-benchmark/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/gto-wizard-benchmark

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-04-02
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 17%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

GTO Wizard Benchmark

Canonical ID gto-wizard-benchmark | Route /paper/gto-wizard-benchmark

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/gto-wizard-benchmark

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2603.23660"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "GTO Wizard Benchmark",
  "normalized_query": "2603.23660",
  "route": "/paper/gto-wizard-benchmark",
  "paper_ref": "gto-wizard-benchmark",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Watch and verify: GTO Wizard Benchmark

/buildability/gto-wizard-benchmark

Watchwatch

Subject: GTO Wizard Benchmark

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

No data, compute, hardware, memory, latency, dependency, or serving requirement receipt is attached.

Source Proof anchors

Visual citations from the paper document graph.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/gto-wizard-benchmark#webpage",
      "url": "https://sciencetostartup.com/paper/gto-wizard-benchmark",
      "name": "GTO Wizard Benchmark",
      "description": "A standardized benchmark and API to evaluate AI agents in complex poker games, revealing significant gaps in current LLM reasoning capabilities.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/gto-wizard-benchmark#scholarlyArticle",
      "headline": "GTO Wizard Benchmark",
      "description": "A standardized benchmark and API to evaluate AI agents in complex poker games, revealing significant gaps in current LLM reasoning capabilities.",
      "url": "https://sciencetostartup.com/paper/gto-wizard-benchmark",
      "sameAs": "https://arxiv.org/abs/2603.23660",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2603.23660"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-03-24T19:04:04.000Z",
      "author": [
        {
          "@type": "Person",
          "name": "Marc-Antoine Provost"
        },
        {
          "@type": "Person",
          "name": "Nejc Ilenic"
        },
        {
          "@type": "Person",
          "name": "Christopher Solinas"
        },
        {
          "@type": "Person",
          "name": "Philippe Beardsell"
        }
      ],
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 7
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Game AI"
        },
        {
          "@type": "PropertyValue",
          "propertyID": "commercialReadiness",
          "value": "code"
        }
      ]
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Game AI",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "GTO Wizard Benchmark",
          "item": "https://sciencetostartup.com/paper/gto-wizard-benchmark"
        }
      ]
    }
  ]
}

0/3 checks · 0%

References 0 / 33+ references
Sources 0 / 22+ sources
Coverage 17% / 50%50%+ coverage

Build Passport

UNVERIFIED

Build Passport

Build passport pending - Proof Lab budget No verified cost estimate / $7.00 cap

status

missing

reason

passport_row_missing

proof status

unverified

cost/budget

No verified cost estimate

confidence low

next verification path

/api/v1/paper/2603.23660v1/paper-pack
/api/v1/paper/2603.23660v1/build-passport
Generate required assets: Dockerfile.minimal, RUN.sh, EXPECTED_OUTPUT.json, COST_PASSPORT.json

Build artifacts

Brief

missing

Build brief missing until Build Passport data exists.

Source missing: Build Passport payload.

Experiment plan

missing

Experiment plan missing until prototype path is available.

No prototype path attached.

Validation checklist

missing

Validation checklist missing until required assets, cost, and regulatory flags are verified.

No checklist artifact is attached to the Build Passport payload.

Signal Canvas

Derived signals show verified:false until source-backed receipts exist.

Open Signal Canvas

SignalSourceStrengthFreshnessOwner action

Evidence coverage

OpportunityKernel evidence_receipt

0 refs / 0 sources / 17% coverage

stale

Verify missing sources before using this as buyer proof. verified:false

Build readiness

BuildPassport EvidenceState

passport absent

stale

Run Proof Lab or inspect typed missing state. verified:false

Artifact maturity

GitHub and Hugging Face maturity payloads

No public artifact surface observed

stale

Open source artifacts or mark the gap as missing. verified:false

Viability breakdown

decision rows

DimensionCurrent readEvidenceGapsNext test

Technical feasibility

partial

Current read

Runnable path is not fully verified.

Evidence

No Build Passport payload attached.

Gaps

No verified reproduction transcript attached.

Next test

Run minimal reproduction from the Build Passport prototype path.

Market urgency

missing

Current read

Buyer urgency is not verified from source.

Talent buckets

no invented people

Scientific founder

Needed now

No named scientific founder assigned.

Paper authors are not treated as operators without consent.

People

No named person assigned.

Gaps

Founder commitment not verified.

Next verification path

Confirm founder/operator owner.

Translational engineer

Needed now

Prototype owner missing.

Build Passport does not name an implementer.

People

No named person assigned.

ARTIFACTS

No public artifacts yet.

DEFENSIBILITY

Defensibility and confidence evidence pending.

GTO Wizard Benchmark

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Timeline

Timeline

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Claim map

Constellation map

Competitive landscape

Buzz

Available agents

API/MCP endpoints

Tool contracts

Payload preview

Schema validation

Job trace

Evidence map

Page Freshness

Paper proof surface

Agent Handoff

GTO Wizard Benchmark

Buildability Receipt

Watch and verify: GTO Wizard Benchmark

Compute envelope

Source Proof anchors

JSON-LD twin

Evidence ids

Freshness

Hash state

Signature state

Blockers

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor