ARXIV:2606.07515 · AI CAPABILITY EVALUATION · SUBMITTED 08 JUN · 17:14 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

How reliable are LLMs when it comes to playing dice?

Luca Avena · Gianmarco Bet · Bernardo Busoni · arXiv

Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.

Ship in 2-4 weeks›Score3.0Evidence unverified

Opportunity summary

Pain Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic…

METHOD

Full abstract

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. Code availability is flagged in the production record;…

WHY NOW

AI Capability Evaluation moved forward this cycle; last verified June 2026. Public score 3.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainExplore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.

Segment

AI Capability Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(12)

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

2026Jasper Dekoninck, Nikola Jovanovic et al.

Mathematical exploration and discovery at scale

2025Bogdan Georgiev, Javier G'omez-Serrano et al.

BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

2025Ivo Petrov, Jasper Dekoninck et al.

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

2024Bowen Jiang, Yangxinyu Xie et al.

(Ir)rationality and cognitive biases in large language models

2024Olivia Macmillan-Scott, Mirco Musolesi

Towards Understanding Sycophancy in Language Models

2023Mrinank Sharma, Meg Tong et al.

Bias and Fairness in Large Language Models: A Survey

2023Isabel O. Gallegos, Ryan A. Rossi et al.

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

2022Thilo Hagendorff, Sarah Fabi et al.

Judgment Under Uncertainty: Heuristics and Biases.

1984G. Shafer, Daniel Kahnerman et al.

A Problem in Probability.

1974D. Rudd

Planar Point Sets with Many Unit Distances

Claude’s Cycles

{ "contract_version": "paper-r2", "paper_id": "0e84cd5f-b0df-47fd-9bee-cfc7ee823b4f", "arxiv_id": "2606.07515", "canonical_route": "/paper/how-reliable-are-llms-when-it-comes-to-playing-dice", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "how-reliable-are-llms-when-it-comes-to-playing-dice", "endpoints": { "paper_pack": "/api/v1/paper/how-reliable-are-llms-when-it-comes-to-playing-dice/paper-pack", "build_passport": "/api/v1/paper/how-reliable-are-llms-when-it-comes-to-playing-dice/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "How reliable are LLMs when it comes to playing dice?", "normalized_query": "2606.07515", "route": "/paper/how-reliable-are-llms-when-it-comes-to-playing-dice", "paper_ref": "how-reliable-are-llms-when-it-comes-to-playing-dice", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/how-reliable-are-llms-when-it-comes-to-playing-dice#webpage", "url": "https://sciencetostartup.com/paper/how-reliable-are-llms-when-it-comes-to-playing-dice", "name": "How reliable are LLMs when it comes to playing dice?", "description": "Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/how-reliable-are-llms-when-it-comes-to-playing-dice#scholarlyArticle", "headline": "How reliable are LLMs when it comes to playing dice?", "description": "Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.", "url": "https://sciencetostartup.com/paper/how-reliable-are-llms-when-it-comes-to-playing-dice", "sameAs": "https://arxiv.org/abs/2606.07515", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2606.07515" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-06-05T17:59:42.000Z", "author": [ { "@type": "Person", "name": "Luca Avena", "affiliation": { "@type": "Organization", "name": "Università degli Studi di Firenze" } }, { "@type": "Person", "name": "Gianmarco Bet", "affiliation": { "@type": "Organization", "name": "Università degli Studi di Firenze" } }, { "@type": "Person", "name": "Bernardo Busoni", "affiliation": { "@type": "Organization", "name": "Università degli Studi di Firenze" } } ], "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "f39ffbd04499f30b87aebfb26cb6a1f3a4584788" }, "url": "https://www.semanticscholar.org/paper/f39ffbd04499f30b87aebfb26cb6a1f3a4584788" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "9761e9c0f5651200924a844d999dd802f8a2f37b" }, "url": "https://www.semanticscholar.org/paper/9761e9c0f5651200924a844d999dd802f8a2f37b" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "5a342615a31ed812f58406d43188603a5358a205" }, "url": "https://www.semanticscholar.org/paper/5a342615a31ed812f58406d43188603a5358a205" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "00e88a0c006296e53bb7d4cfc90a134883ad34fd" }, "url": "https://www.semanticscholar.org/paper/00e88a0c006296e53bb7d4cfc90a134883ad34fd" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "274bf622d6e28581c6f0fbb039e07c3723ff29f7" }, "url": "https://www.semanticscholar.org/paper/274bf622d6e28581c6f0fbb039e07c3723ff29f7" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "e6423c211fea2945aa71e1ac5ea24f8f595b4b0a" }, "url": "https://www.semanticscholar.org/paper/e6423c211fea2945aa71e1ac5ea24f8f595b4b0a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "bcfa73aedf1b2d1ee4f168e21298a37ac55a37f7" }, "url": "https://www.semanticscholar.org/paper/bcfa73aedf1b2d1ee4f168e21298a37ac55a37f7" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "0bfc05adcddd4fe5d1335d96cc313c41526d4558" }, "url": "https://www.semanticscholar.org/paper/0bfc05adcddd4fe5d1335d96cc313c41526d4558" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "a15e5d4aaca9378ca89da1e0098cdcb19fa8bbcb" }, "url": "https://www.semanticscholar.org/paper/a15e5d4aaca9378ca89da1e0098cdcb19fa8bbcb" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "60f423493b78a9e38974ffb14f1c40c95b16860a" }, "url": "https://www.semanticscholar.org/paper/60f423493b78a9e38974ffb14f1c40c95b16860a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "9d7fda14854c7c04c544be722841e0c01b70a53f" }, "url": "https://www.semanticscholar.org/paper/9d7fda14854c7c04c544be722841e0c01b70a53f" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d120f78b946bbc7ba77b0121dfce571ac357481b" }, "url": "https://www.semanticscholar.org/paper/d120f78b946bbc7ba77b0121dfce571ac357481b" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Capability Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Capability Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "How reliable are LLMs when it comes to playing dice?", "item": "https://sciencetostartup.com/paper/how-reliable-are-llms-when-it-comes-to-playing-dice" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"How reliable are LLMs when it comes to playing dice?\"?", "acceptedAnswer": { "@type": "Answer", "text": "Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "This research can be productized as a benchmarking tool or dataset offering for AI developers to test the reasoning capabilities of their language models, especially in probability and decision-making tasks." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A tool for testing the probabilistic reasoning capabilities of AI systems, which could be useful for researchers and developers seeking to deploy AI in applications involving decision-making under uncertainty." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "The study challenges the assumption that LLMs are ready for tasks involving complex reasoning under uncertainty and suggests the need for more robust testing before deployment." } } ] } ] }

Competitive landscape

Explore the reliability of LLMs in probabilistic reasoning tasks through a novel benchmark dataset.

Segment

AI Capability Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(12)

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

2026Jasper Dekoninck, Nikola Jovanovic et al.

Mathematical exploration and discovery at scale

2025Bogdan Georgiev, Javier G'omez-Serrano et al.

BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

2025Ivo Petrov, Jasper Dekoninck et al.

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

2024Bowen Jiang, Yangxinyu Xie et al.

(Ir)rationality and cognitive biases in large language models

2024Olivia Macmillan-Scott, Mirco Musolesi

Towards Understanding Sycophancy in Language Models

2023Mrinank Sharma, Meg Tong et al.

Bias and Fairness in Large Language Models: A Survey

2023Isabel O. Gallegos, Ryan A. Rossi et al.

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

2022Thilo Hagendorff, Sarah Fabi et al.

Judgment Under Uncertainty: Heuristics and Biases.

1984G. Shafer, Daniel Kahnerman et al.

A Problem in Probability.

1974D. Rudd

Planar Point Sets with Many Unit Distances

Claude’s Cycles

How reliable are LLMs when it comes to playing dice?

How reliable are LLMs when it comes to playing dice?

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(12)

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(12)

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline