GEditBench v2: A Human-Aligned Benchmark for General Image Editing

GEditBench v2: A Human-Aligned Benchmark for General Image Editing | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/geditbench-v2-a-human-aligned-benchmark-for-general-image-editing

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-31
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 53
Source count: 3
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID geditbench-v2-a-human-aligned-benchmark-for-general-image-editing | Route /signal-canvas/geditbench-v2-a-human-aligned-benchmark-for-general-image-editing

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/geditbench-v2-a-human-aligned-benchmark-for-general-image-editing

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "geditbench-v2-a-human-aligned-benchmark-for-general-image-editing",
    "query_text": "Summarize GEditBench v2: A Human-Aligned Benchmark for General Image Editing"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "GEditBench v2: A Human-Aligned Benchmark for General Image Editing",
  "normalized_query": "2603.28547",
  "route": "/signal-canvas/geditbench-v2-a-human-aligned-benchmark-for-general-image-editing",
  "paper_ref": "geditbench-v2-a-human-aligned-benchmark-for-general-image-editing",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 8

References: 53

Proof: Verification pending

Freshness state: computing

Source paper: GEditBench v2: A Human-Aligned Benchmark for General Image Editing

PDF: https://arxiv.org/pdf/2603.28547v1

Source count: 3

Coverage: 50%

Last proof check: 2026-03-31T20:17:13.653Z

Signal Canvas receipt window

Watch and verify: GEditBench v2: A Human-Aligned Benchmark for General Image Editing

/buildability/geditbench-v2-a-human-aligned-benchmark-for-general-image-editing

Watchwatch

Subject: GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks.
Implicationpartial
Explicitly stated in the abstract and repeated in the parsed sections with specific numbers.
Verificationpartial
partial
Evidencepartial
PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average.
Implicationpartial
Directly stated in the abstract and page 1 with a performance claim.
Verificationpartial
partial
Evidencepartial
Qwen2.5-VL-7B underperforms GPT-4o-2024-11-20 by 8.41% on average under a four-image setting, with the gap expanding to 30.05% as the number of input images increases.
Implicationpartial
Supported by a cited study with specific percentage gaps mentioned, though the exact source table is not fully quoted.
Verificationpartial
partial
Evidencepartial
VCReward-Bench 3.5K 21
Implicationpartial
Explicitly stated with comparative numbers in a table format within the parsed text.
Verificationpartial
partial
Evidencepartial
Our protocol constructs preference pairs across three specialized pipelines: object- and human-centric pipelines for local editing, and a VLM-as-a-Judge approach for global tasks
Implicationpartial
Described in the methodology section, though the details are summarized and not fully quoted.
Verificationpartial
partial
Evidencepartial
we utilize CLIP-based Earth Mover’s Distance (EMD) to ensure both low-level visual features and high-level semantic content remain unchanged. Conversely, within Ωedit, we utilize task-specific metrics to decouple identity preservation from the editing effects
Implicationpartial
Technical details are provided in the parsed text, though the explanation is somewhat abbreviated.
Verificationpartial
partial
Evidencepartial
by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models
Implicationpartial
Stated in the abstract as a key finding from the experiments.
Verificationpartial
partial
Evidencepartial
GEditBench v2’s taxonomy from three open-source datasets: Pico-Banana-400K, Nano-Consistency-150K, and UnicEdit-10M.
Implicationpartial
Explicitly described in the data curation section, though the selection process is summarized.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface