ARXIV:2604.04791 · LLM EVALUATION · SUBMITTED 07 APR · 20:13 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Yuhang Liu · Heyan Huang · Yizhe Yang · Hongyan Zhao · Zhizhuo Zeng · Yang Gao · arXiv

A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.

Ship in 2-4 weeks›Score4.0Evidence unverified

Opportunity summary

Pain A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability.

METHOD

Full abstract

Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation,…

WHY NOW

LLM Evaluation moved forward this cycle; last verified April 2026. Public score 4.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "e2838a6f-2685-448a-a19a-9577c18bc126", "arxiv_id": "2604.04791", "canonical_route": "/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling", "endpoints": { "paper_pack": "/api/v1/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling/paper-pack", "build_passport": "/api/v1/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling", "normalized_query": "2604.04791", "route": "/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling", "paper_ref": "how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling#webpage", "url": "https://sciencetostartup.com/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling", "name": "How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling", "description": "A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling#scholarlyArticle", "headline": "How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling", "description": "A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.", "url": "https://sciencetostartup.com/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling", "sameAs": "https://arxiv.org/abs/2604.04791", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.04791" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-06T15:58:47.000Z", "author": [ { "@type": "Person", "name": "Yuhang Liu" }, { "@type": "Person", "name": "Heyan Huang" }, { "@type": "Person", "name": "Yizhe Yang" }, { "@type": "Person", "name": "Hongyan Zhao" }, { "@type": "Person", "name": "Zhizhuo Zeng" }, { "@type": "Person", "name": "Yang Gao" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "How Far Are We? Systematic Evaluation of LLMs vs. Human Expe", "item": "https://sciencetostartup.com/paper/how-far-are-we-systematic-evaluation-of-llms-vs-human-experts-in-mathematical-contest-in-modeling" } ] } ] }

Competitive landscape

A new framework to systematically evaluate LLMs on complex, real-world problem-solving tasks, revealing critical execution gaps.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline