ARXIV:2604.28139 · LLM AGENT BENCHMARKING · SUBMITTED 01 MAY · 15:04 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li · Zhengyang Tang · Huangxin Lin · Yunlong Lin · Shijue Huang · Shengyuan Liu · +5 at arXiv

A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly…

METHOD

Full abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action. Code availability is flagged in the…

WHY NOW

LLM Agent Benchmarking moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.

Segment

LLM Agent Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c19d8395-ee85-4185-bc5e-13d1c21c44e5", "arxiv_id": "2604.28139", "canonical_route": "/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows", "endpoints": { "paper_pack": "/api/v1/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows/paper-pack", "build_passport": "/api/v1/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows", "normalized_query": "2604.28139", "route": "/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows", "paper_ref": "claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows#webpage", "url": "https://sciencetostartup.com/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows", "name": "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows", "description": "A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows#scholarlyArticle", "headline": "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows", "description": "A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.", "url": "https://sciencetostartup.com/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows", "sameAs": "https://arxiv.org/abs/2604.28139", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.28139" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-30T17:23:19.000Z", "author": [ { "@type": "Person", "name": "Chenxin Li" }, { "@type": "Person", "name": "Zhengyang Tang" }, { "@type": "Person", "name": "Huangxin Lin" }, { "@type": "Person", "name": "Yunlong Lin" }, { "@type": "Person", "name": "Shijue Huang" }, { "@type": "Person", "name": "Shengyuan Liu" }, { "@type": "Person", "name": "Bowen Ye" }, { "@type": "Person", "name": "Rang Li" }, { "@type": "Person", "name": "Lei Li" }, { "@type": "Person", "name": "Benyou Wang" }, { "@type": "Person", "name": "Yixuan Yuan" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Agent Benchmarking" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Agent Benchmarking", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor", "item": "https://sciencetostartup.com/paper/claw-eval-live-a-live-agent-benchmark-for-evolving-real-world-workflows" } ] } ] }

Competitive landscape

A live benchmark for LLM agents that evaluates their ability to complete evolving real-world workflows with verifiable execution traces.

Segment

LLM Agent Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline