ARXIV:2603.25226 · AGENTS · SUBMITTED 27 MAR · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Fanheng Kong · Jingyuan Zhang · Yang Yue · Chenxi Sun · Yang Tian · Shi Feng · +7 at arXiv

A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.

Evidence 0 refs | 0 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment. This paradigm has driven automated webpage development, but it…

METHOD

Full abstract

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Our dataset and code are available at https://github.com/friedrichor/WebTestBench. A public repository is linked, so build verification can inspect implementation evidence instead of treating the…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.

Evidence0 refs | 0 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "28db480f-eaa2-48f8-8335-341af419a0d1", "arxiv_id": "2603.25226", "canonical_route": "/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing", "endpoints": { "paper_pack": "/api/v1/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing/paper-pack", "build_passport": "/api/v1/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing", "normalized_query": "2603.25226", "route": "/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing", "paper_ref": "webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing#webpage", "url": "https://sciencetostartup.com/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing", "name": "WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing", "description": "A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing#scholarlyArticle", "headline": "WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing", "description": "A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.", "url": "https://sciencetostartup.com/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing", "sameAs": "https://arxiv.org/abs/2603.25226", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.25226" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-26T09:27:29.000Z", "author": [ { "@type": "Person", "name": "Fanheng Kong" }, { "@type": "Person", "name": "Jingyuan Zhang" }, { "@type": "Person", "name": "Yang Yue" }, { "@type": "Person", "name": "Chenxi Sun" }, { "@type": "Person", "name": "Yang Tian" }, { "@type": "Person", "name": "Shi Feng" }, { "@type": "Person", "name": "Xiaocui Yang" }, { "@type": "Person", "name": "Daling Wang" }, { "@type": "Person", "name": "Yu Tian" }, { "@type": "Person", "name": "Jun Du" }, { "@type": "Person", "name": "Wenchong Zeng" }, { "@type": "Person", "name": "Han Li" }, { "@type": "Person", "name": "Kun Gai" } ], "codeRepository": "https://github.com/friedrichor/WebTestBench", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing#software", "name": "WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing - Source Code", "description": "A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.", "codeRepository": "https://github.com/friedrichor/WebTestBench", "url": "https://github.com/friedrichor/WebTestBench" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "WebTestBench: Evaluating Computer-Use Agents towards End-to-", "item": "https://sciencetostartup.com/paper/webtestbench-evaluating-computer-use-agents-towards-end-to-end-automated-web-testing" } ] } ] }

Competitive landscape

A benchmark and baseline framework for evaluating and improving end-to-end automated web testing agents, addressing critical gaps in test completeness and reliability for industrial deployment.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline