ARXIV:2604.11201 · UNIFIED AI AGENTS · SUBMITTED 14 APR · 16:49 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team · Shibo Hao · Zhining Zhang · Zhiqi Liang · Tianyang Liu · Yuheng Zha · +26 at arXiv

CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.

Ship in 2-4 weeks›Score6.0Evidence unverified

Opportunity summary

Pain CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks. Yet, most evaluations still test these capabilities in isolation, which leaves a…

METHOD

Full abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Code availability is…

WHY NOW

Unified AI Agents moved forward this cycle; last verified April 2026. Public score 6.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainCocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.

Segment

Unified AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5841f204-bc64-4c47-8a0f-1a9afbf13600", "arxiv_id": "2604.11201", "canonical_route": "/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "cocoabench-evaluating-unified-digital-agents-in-the-wild", "endpoints": { "paper_pack": "/api/v1/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild/paper-pack", "build_passport": "/api/v1/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "CocoaBench: Evaluating Unified Digital Agents in the Wild", "normalized_query": "2604.11201", "route": "/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild", "paper_ref": "cocoabench-evaluating-unified-digital-agents-in-the-wild", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild#webpage", "url": "https://sciencetostartup.com/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild", "name": "CocoaBench: Evaluating Unified Digital Agents in the Wild", "description": "CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild#scholarlyArticle", "headline": "CocoaBench: Evaluating Unified Digital Agents in the Wild", "description": "CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.", "url": "https://sciencetostartup.com/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild", "sameAs": "https://arxiv.org/abs/2604.11201", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.11201" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-13T09:00:10.000Z", "author": [ { "@type": "Person", "name": "CocoaBench Team" }, { "@type": "Person", "name": "Shibo Hao" }, { "@type": "Person", "name": "Zhining Zhang" }, { "@type": "Person", "name": "Zhiqi Liang" }, { "@type": "Person", "name": "Tianyang Liu" }, { "@type": "Person", "name": "Yuheng Zha" }, { "@type": "Person", "name": "Qiyue Gao" }, { "@type": "Person", "name": "Jixuan Chen" }, { "@type": "Person", "name": "Zilong Wang" }, { "@type": "Person", "name": "Zhoujun Cheng" }, { "@type": "Person", "name": "Haoxiang Zhang" }, { "@type": "Person", "name": "Junli Wang" }, { "@type": "Person", "name": "Hexi Jin" }, { "@type": "Person", "name": "Boyuan Zheng" }, { "@type": "Person", "name": "Kun Zhou" }, { "@type": "Person", "name": "Yu Wang" }, { "@type": "Person", "name": "Feng Yao" }, { "@type": "Person", "name": "Licheng Liu" }, { "@type": "Person", "name": "Yijiang Li" }, { "@type": "Person", "name": "Zhifei Li" }, { "@type": "Person", "name": "Zhengtao Han" }, { "@type": "Person", "name": "Pracha Promthaw" }, { "@type": "Person", "name": "Tommaso Cerruti" }, { "@type": "Person", "name": "Xiaohan Fu" }, { "@type": "Person", "name": "Ziqiao Ma" }, { "@type": "Person", "name": "Jingbo Shang" }, { "@type": "Person", "name": "Lianhui Qin" }, { "@type": "Person", "name": "Julian McAuley" }, { "@type": "Person", "name": "Eric P. Xing" }, { "@type": "Person", "name": "Zhengzhong Liu" }, { "@type": "Person", "name": "Rupesh Kumar Srivastava" }, { "@type": "Person", "name": "Zhiting Hu" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Unified AI Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Unified AI Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "CocoaBench: Evaluating Unified Digital Agents in the Wild", "item": "https://sciencetostartup.com/paper/cocoabench-evaluating-unified-digital-agents-in-the-wild" } ] } ] }

Competitive landscape

CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.

Segment

Unified AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench: Evaluating Unified Digital Agents in the Wild

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline