ARXIV:2604.11996 · LLM EVALUATION · SUBMITTED 15 APR · 20:33 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: partial proof status

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Manas Pathak · Xingyao Chen · Shuozhe Li · Amy Zhang · Liu Leqi · arXiv

A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.

Ship in 2-4 weeks›Score7.0Evidence partial

Opportunity summary

Pain A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence partial

Open Build Read PDF Signal Canvas Track

PROBLEM

A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available. LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the…

METHOD

Full abstract

Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. A public…

WHY NOW

LLM Evaluation moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: partial proof status

Competitive landscape

A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.

Segment

LLM Evaluation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "cac86118-b060-4552-b651-026cc43d05c3", "arxiv_id": "2604.11996", "canonical_route": "/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces", "endpoints": { "paper_pack": "/api/v1/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces/paper-pack", "build_passport": "/api/v1/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces", "normalized_query": "2604.11996", "route": "/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces", "paper_ref": "filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces#webpage", "url": "https://sciencetostartup.com/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces", "name": "Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces", "description": "A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces#scholarlyArticle", "headline": "Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces", "description": "A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.", "url": "https://sciencetostartup.com/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces", "sameAs": "https://arxiv.org/abs/2604.11996", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.11996" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-13T19:37:09.000Z", "author": [ { "@type": "Person", "name": "Manas Pathak" }, { "@type": "Person", "name": "Xingyao Chen" }, { "@type": "Person", "name": "Shuozhe Li" }, { "@type": "Person", "name": "Amy Zhang" }, { "@type": "Person", "name": "Liu Leqi" } ], "codeRepository": "https://github.com/Manas2006/benchmark_reproducibility", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces#software", "name": "Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces - Source Code", "description": "A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.", "codeRepository": "https://github.com/Manas2006/benchmark_reproducibility", "url": "https://github.com/Manas2006/benchmark_reproducibility" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Filtered Reasoning Score: Evaluating Reasoning Quality on a ", "item": "https://sciencetostartup.com/paper/filtered-reasoning-score-evaluating-reasoning-quality-on-a-model-s-most-confident-traces" } ] } ] }

Competitive landscape

A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.

Segment

LLM Evaluation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline