ARXIV:2604.14140 · LLM EVALUATION · SUBMITTED 16 APR · 18:18 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Sumeet Ramesh Motwani · Daniel Nichols · Charles London · Peggy Li · Fabio Pizzati · Acer Blake · +14 at arXiv

Introducing LongCoT, a benchmark for evaluating long-horizon chain-of-thought reasoning in LLMs, revealing a significant gap in current model capabilities and providing a rigorous measure for future progress.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Introducing LongCoT, a benchmark for evaluating long-horizon chain-of-thought reasoning in LLMs, revealing a significant gap in current model capabilities and providing a rigorous measure for future progress.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities.…

WHY NOW

LLM Evaluation moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainIntroducing LongCoT, a benchmark for evaluating long-horizon chain-of-thought reasoning in LLMs, revealing a significant gap in current model capabilities and providing a rigorous measure for future progress.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Sumeet Ramesh Motwani · Daniel Nichols · Charles London · Peggy Li · Fabio Pizzati · Acer Blake · +14 at arXiv

Competitive landscape

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "df35e587-3455-4e22-86ec-804a205a7632", "arxiv_id": "2604.14140", "canonical_route": "/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "longcot-benchmarking-long-horizon-chain-of-thought-reasoning", "endpoints": { "paper_pack": "/api/v1/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning/paper-pack", "build_passport": "/api/v1/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning", "normalized_query": "2604.14140", "route": "/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning", "paper_ref": "longcot-benchmarking-long-horizon-chain-of-thought-reasoning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning#webpage", "url": "https://sciencetostartup.com/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning", "name": "LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning", "description": "Introducing LongCoT, a benchmark for evaluating long-horizon chain-of-thought reasoning in LLMs, revealing a significant gap in current model capabilities and providing a rigorous measure for future progress.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning#scholarlyArticle", "headline": "LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning", "description": "Introducing LongCoT, a benchmark for evaluating long-horizon chain-of-thought reasoning in LLMs, revealing a significant gap in current model capabilities and providing a rigorous measure for future progress.", "url": "https://sciencetostartup.com/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning", "sameAs": "https://arxiv.org/abs/2604.14140", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.14140" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-15T17:58:05.000Z", "author": [ { "@type": "Person", "name": "Sumeet Ramesh Motwani" }, { "@type": "Person", "name": "Daniel Nichols" }, { "@type": "Person", "name": "Charles London" }, { "@type": "Person", "name": "Peggy Li" }, { "@type": "Person", "name": "Fabio Pizzati" }, { "@type": "Person", "name": "Acer Blake" }, { "@type": "Person", "name": "Hasan Hammoud" }, { "@type": "Person", "name": "Tavish McDonald" }, { "@type": "Person", "name": "Akshat Naik" }, { "@type": "Person", "name": "Alesia Ivanova" }, { "@type": "Person", "name": "Vignesh Baskaran" }, { "@type": "Person", "name": "Ivan Laptev" }, { "@type": "Person", "name": "Ruben Glatt" }, { "@type": "Person", "name": "Tal Ben-Nun" }, { "@type": "Person", "name": "Philip Torr" }, { "@type": "Person", "name": "Natasha Jaques" }, { "@type": "Person", "name": "Ameya Prabhu" }, { "@type": "Person", "name": "Brian Bartoldson" }, { "@type": "Person", "name": "Bhavya Kailkhura" }, { "@type": "Person", "name": "Christian Schroeder de Witt" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasonin", "item": "https://sciencetostartup.com/paper/longcot-benchmarking-long-horizon-chain-of-thought-reasoning" } ] } ] }

Competitive landscape

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline