ARXIV:2605.20520 · AGENTS · SUBMITTED 21 MAY · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor · Peter Kirgis · Andrew Schwartz · Stephan Rabanser · J. J. Allaire · Rishi Bommasani · +12 at arXiv

A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified,…

METHOD

Full abstract

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We conclude with recommendations for designing and reporting open-world evals. Code availability is flagged in the production record; the public repository link still needs…

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "0f2a8773-c26b-41d5-aa9b-09c5a3057230", "arxiv_id": "2605.20520", "canonical_route": "/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "open-world-evaluations-for-measuring-frontier-ai-capabilities", "endpoints": { "paper_pack": "/api/v1/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities/paper-pack", "build_passport": "/api/v1/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Open-World Evaluations for Measuring Frontier AI Capabilities", "normalized_query": "2605.20520", "route": "/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities", "paper_ref": "open-world-evaluations-for-measuring-frontier-ai-capabilities", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities#webpage", "url": "https://sciencetostartup.com/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities", "name": "Open-World Evaluations for Measuring Frontier AI Capabilities", "description": "A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities#scholarlyArticle", "headline": "Open-World Evaluations for Measuring Frontier AI Capabilities", "description": "A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.", "url": "https://sciencetostartup.com/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities", "sameAs": "https://arxiv.org/abs/2605.20520", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.20520" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-19T21:42:32.000Z", "author": [ { "@type": "Person", "name": "Sayash Kapoor" }, { "@type": "Person", "name": "Peter Kirgis" }, { "@type": "Person", "name": "Andrew Schwartz" }, { "@type": "Person", "name": "Stephan Rabanser" }, { "@type": "Person", "name": "J. J. Allaire" }, { "@type": "Person", "name": "Rishi Bommasani" }, { "@type": "Person", "name": "Harry Coppock" }, { "@type": "Person", "name": "Magda Dubois" }, { "@type": "Person", "name": "Gillian K Hadfield" }, { "@type": "Person", "name": "Andrew B. Hall" }, { "@type": "Person", "name": "Sara Hooker" }, { "@type": "Person", "name": "Seth Lazar" }, { "@type": "Person", "name": "Steve Newman" }, { "@type": "Person", "name": "Dimitris Papailiopoulos" }, { "@type": "Person", "name": "Shoshannah Tekofsky" }, { "@type": "Person", "name": "Helen Toner" }, { "@type": "Person", "name": "Cozmin Ududec" }, { "@type": "Person", "name": "Arvind Narayanan" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Open-World Evaluations for Measuring Frontier AI Capabilitie", "item": "https://sciencetostartup.com/paper/open-world-evaluations-for-measuring-frontier-ai-capabilities" } ] } ] }

Competitive landscape

A project for conducting long-horizon, messy, real-world AI evaluations to provide early warnings of emerging capabilities.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Open-World Evaluations for Measuring Frontier AI Capabilities

Open-World Evaluations for Measuring Frontier AI Capabilities

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline