ARXIV:2606.09450 · UNCATEGORIZED · SUBMITTED 09 JUN · 03:25 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

QuocViet Pham · Elvir Karimov · Andrey Galichin · Ivan Oseledets · arXiv

ScienceToStartup currently rates this 0.0/10 on the public viability pass. LLMs have recently achieved strong results on formal proving benchmarks. Code availability is flagged in the production record; the public…

Ship in 2-4 weeks›Score0.0Evidence unverified

Opportunity summary

Pain customer pain not on file

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

LLMs have recently achieved strong results on formal proving benchmarks.

METHOD

Full abstract

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

RESULT

WHY NOW

Uncategorized moved forward this cycle; last verified June 2026. Public score 0.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score0.0

Paincustomer pain not on file

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

No named competitor graph is public yet; the page still exposes the segment, adoption evidence, and score state so the commercial read is not blank.

Segment

Uncategorized

Adoption evidence

No public code link in the paper record yet

Commercial read

0.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "331e59c0-958d-41c3-bc96-4a2c4e640826", "arxiv_id": "2606.09450", "canonical_route": "/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics", "endpoints": { "paper_pack": "/api/v1/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics/paper-pack", "build_passport": "/api/v1/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics", "normalized_query": "2606.09450", "route": "/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics", "paper_ref": "theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics#webpage", "url": "https://sciencetostartup.com/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics", "name": "TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics", "description": "LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics#scholarlyArticle", "headline": "TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics", "description": "LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main versi…", "url": "https://sciencetostartup.com/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics", "sameAs": "https://arxiv.org/abs/2606.09450", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2606.09450" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-06-08T12:57:18.000Z", "author": [ { "@type": "Person", "name": "QuocViet Pham" }, { "@type": "Person", "name": "Elvir Karimov" }, { "@type": "Person", "name": "Andrey Galichin" }, { "@type": "Person", "name": "Ivan Oseledets" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Uncategorized" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Uncategorized", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "TheoremBench: Evaluating LLMs on Theorem Proving in Formal M", "item": "https://sciencetostartup.com/paper/theorembench-evaluating-llms-on-theorem-proving-in-formal-mathematics" } ] } ] }

Competitive landscape

No named competitor graph is public yet; the page still exposes the segment, adoption evidence, and score state so the commercial read is not blank.

Segment

Uncategorized

Adoption evidence

No public code link in the paper record yet

Commercial read

0.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline