ARXIV:2606.03144 · LLM EVALUATION · SUBMITTED 03 JUN · 20:43 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Noujoud Nader · Ibrahem Aljabea · Patrick Diehl · Deepti Gupta · arXiv

A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph…

METHOD

Full abstract

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while…

WHY NOW

LLM Evaluation moved forward this cycle; last verified June 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "e8d74036-b0b4-4d82-aaef-ba7b0603a9e7", "arxiv_id": "2606.03144", "canonical_route": "/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory", "endpoints": { "paper_pack": "/api/v1/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory/paper-pack", "build_passport": "/api/v1/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory", "normalized_query": "2606.03144", "route": "/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory", "paper_ref": "gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory#webpage", "url": "https://sciencetostartup.com/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory", "name": "GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory", "description": "A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory#scholarlyArticle", "headline": "GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory", "description": "A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.", "url": "https://sciencetostartup.com/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory", "sameAs": "https://arxiv.org/abs/2606.03144", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2606.03144" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-06-02T04:40:25.000Z", "author": [ { "@type": "Person", "name": "Noujoud Nader" }, { "@type": "Person", "name": "Ibrahem Aljabea" }, { "@type": "Person", "name": "Patrick Diehl" }, { "@type": "Person", "name": "Deepti Gupta" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs", "item": "https://sciencetostartup.com/paper/gtbench-a-curriculum-grounded-benchmark-for-evaluating-llms-as-mathematical-research-assistants-in-graph-theory" } ] } ] }

Competitive landscape

A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline