ARXIV:2604.15760 · LLM EVALUATION · SUBMITTED 20 APR · 20:23 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo · arXiv

KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.

Evidence 0 refs | 5 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to…

METHOD

Full abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the…

WHY NOW

LLM Evaluation moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainKWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.

Evidence0 refs | 5 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.

Segment

LLM Evaluation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "728cd30d-0078-4747-97b4-5c6075ac50d4", "arxiv_id": "2604.15760", "canonical_route": "/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "kwbench-measuring-unprompted-problem-recognition-in-knowledge-work", "endpoints": { "paper_pack": "/api/v1/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work/paper-pack", "build_passport": "/api/v1/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "KWBench: Measuring Unprompted Problem Recognition in Knowledge Work", "normalized_query": "2604.15760", "route": "/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work", "paper_ref": "kwbench-measuring-unprompted-problem-recognition-in-knowledge-work", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work#webpage", "url": "https://sciencetostartup.com/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work", "name": "KWBench: Measuring Unprompted Problem Recognition in Knowledge Work", "description": "KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work#scholarlyArticle", "headline": "KWBench: Measuring Unprompted Problem Recognition in Knowledge Work", "description": "KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.", "url": "https://sciencetostartup.com/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work", "sameAs": "https://arxiv.org/abs/2604.15760", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.15760" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-17T07:04:54.000Z", "author": [ { "@type": "Person", "name": "Ankit Maloo" } ], "codeRepository": "https://github.com/ankitmaloo/fasteval", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work#software", "name": "KWBench: Measuring Unprompted Problem Recognition in Knowledge Work - Source Code", "description": "KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.", "codeRepository": "https://github.com/ankitmaloo/fasteval", "url": "https://github.com/ankitmaloo/fasteval" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "KWBench: Measuring Unprompted Problem Recognition in Knowled", "item": "https://sciencetostartup.com/paper/kwbench-measuring-unprompted-problem-recognition-in-knowledge-work" } ] } ] }

Competitive landscape

KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.

Segment

LLM Evaluation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline