ARXIV:2605.23262 · LLM AGENTS · SUBMITTED 25 MAY · 20:37 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Design and Report Benchmarks for Knowledge Work

Yining Hua · Hongbin Na · Cyrus Ayubcha · Levi Lian · arXiv

A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.

Ship in 2-4 weeks›Score5.0Evidence unverified

Opportunity summary

Pain A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability. However, current knowledge-work evaluation and benchmark design still largely follow the logic of…

METHOD

Full abstract

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. Code availability…

WHY NOW

LLM Agents moved forward this cycle; last verified May 2026. Public score 5.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainA new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.

Segment

LLM Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "d29ca0e6-af95-4aac-8383-c38c2cf6c124", "arxiv_id": "2605.23262", "canonical_route": "/paper/design-and-report-benchmarks-for-knowledge-work", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "design-and-report-benchmarks-for-knowledge-work", "endpoints": { "paper_pack": "/api/v1/paper/design-and-report-benchmarks-for-knowledge-work/paper-pack", "build_passport": "/api/v1/paper/design-and-report-benchmarks-for-knowledge-work/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Design and Report Benchmarks for Knowledge Work", "normalized_query": "2605.23262", "route": "/paper/design-and-report-benchmarks-for-knowledge-work", "paper_ref": "design-and-report-benchmarks-for-knowledge-work", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/design-and-report-benchmarks-for-knowledge-work#webpage", "url": "https://sciencetostartup.com/paper/design-and-report-benchmarks-for-knowledge-work", "name": "Design and Report Benchmarks for Knowledge Work", "description": "A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/design-and-report-benchmarks-for-knowledge-work#scholarlyArticle", "headline": "Design and Report Benchmarks for Knowledge Work", "description": "A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.", "url": "https://sciencetostartup.com/paper/design-and-report-benchmarks-for-knowledge-work", "sameAs": "https://arxiv.org/abs/2605.23262", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.23262" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-22T06:03:01.000Z", "author": [ { "@type": "Person", "name": "Yining Hua" }, { "@type": "Person", "name": "Hongbin Na" }, { "@type": "Person", "name": "Cyrus Ayubcha" }, { "@type": "Person", "name": "Levi Lian" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Design and Report Benchmarks for Knowledge Work", "item": "https://sciencetostartup.com/paper/design-and-report-benchmarks-for-knowledge-work" } ] } ] }

Competitive landscape

A new benchmark design and reporting framework for knowledge work AI that better reflects real-world deployment settings and work product usability.

Segment

LLM Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Design and Report Benchmarks for Knowledge Work

Design and Report Benchmarks for Knowledge Work

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline