Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-31
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 51
Source count: 3
Coverage: 67%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification | Route /signal-canvas/vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification",
    "query_text": "Summarize Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification",
  "normalized_query": "2603.26648",
  "route": "/signal-canvas/vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification",
  "paper_ref": "vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 7

References: 51

Proof: Verification pending

Freshness state: computing

Source paper: Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

PDF: https://arxiv.org/pdf/2603.26648v1

Source count: 3

Coverage: 67%

Last proof check: 2026-03-31T20:30:20.275Z

Signal Canvas receipt window

Watch and verify: Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

/buildability/vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification

Watchwatch

Subject: Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 7Mixed 0Weak 0

Evidencepartial
To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development.
Implicationpartial
This is a core definition of the benchmark presented in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases.
Implicationpartial
Specific quantitative details about the benchmark's composition are provided in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge.
Implicationpartial
The abstract explicitly describes the verification paradigm and its components.
Verificationpartial
partial
Evidencepartial
We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Implicationpartial
The abstract and experimental results tables clearly indicate performance gaps and limitations of current models.
Verificationpartial
partial
Evidencepartial
At the node level, 218 of 250 nodes (87.2%) are correctly judged by the verifier relative to human annotations, indicating high fine-grained execution accuracy.
Implicationpartial
This claim is supported by specific validation metrics for the GUI Agent Verifier.
Verificationpartial
partial
Evidencepartial
When examined at the level of individual test-case categories, Navigation & Routing and Authentication & Authorization are the most reliable capabilities across models, with Claude-Opus-4.5 and GPT-5 achieving consistently high pass scores.
Implicationpartial
This is inferred from the performance tables and the text discussing specific functional categories.
Verificationpartial
partial
Evidencepartial
Finding 6:At the level of individual functional categories, agents exhibit systematic weaknesses in complex, state-dependent operations.
Implicationpartial
This is supported by the performance breakdown by functional categories in the results.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface