CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-31
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 22
Source count: 4
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models | Route /signal-canvas/cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models",
    "query_text": "Summarize CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models",
  "normalized_query": "2603.27982",
  "route": "/signal-canvas/cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models",
  "paper_ref": "cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 8

References: 22

Proof: Verification pending

Freshness state: computing

Source paper: CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

PDF: https://arxiv.org/pdf/2603.27982v1

Source count: 4

Coverage: 50%

Last proof check: 2026-03-31T20:21:22.553Z

Signal Canvas receipt window

Watch and verify: CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

/buildability/cdh-bench-a-commonsense-driven-hallucination-benchmark-for-evaluating-visual-fidelity-in-vision-language-models

Watchwatch

Subject: CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
Most benchmarks use commonsense-consistent imagery, so visual evidence and commonsense priors typically agree.
Implicationpartial
Directly stated in the abstract and analysis as the motivation for creating CDH-Bench.
Verificationpartial
partial
Evidencepartial
To evaluate it, we introduce CDH-Bench, a benchmark designed to create explicit visual evidence–commonsense conflicts. CDH-Bench covers three dimensions: counting anomalies, relational anomalies, and attribute anomalies.
Implicationpartial
Explicitly stated in the abstract and title as the core contribution of the paper.
Verificationpartial
partial
Evidencepartial
Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence–commonsense conflict.
Implicationpartial
Directly stated in the abstract as a key finding from evaluating frontier models.
Verificationpartial
partial
Evidencepartial
This distinction, quantified through CCR, provides a sharper diagnostic signal than accuracy alone, where direct answer competition makes the interpretation most transparent.
Implicationpartial
Strongly supported by the analysis describing CCR's purpose and advantage over standard accuracy.
Verificationpartial
partial
Evidencepartial
We construct 600 images, organized as 300 counterfactual and CS images... yielding 300×2×2 = 1,200 evaluated instances in total.
Implicationpartial
Specific numeric details are provided in the analysis section.
Verificationpartial
partial
Evidencepartial
CF-Acc is the accuracy on counterfactual (CF) images, and is our primary measure of visual fidelity under conflict.
Implicationpartial
Explicitly stated in the metrics description section.
Verificationpartial
partial
Evidencepartial
CDH matters most where anomalies matter: medical imaging, quality inspection, scientific discovery, and forensics.
Implicationpartial
Directly stated in the analysis with specific domain examples.
Verificationpartial
partial
Evidencepartial
RPD also answers Q2, but in relative terms: of what the model can do when visual evidence and commonsense agree, how much is lost when they conflict?
Implicationpartial
Supported by the description of RPD's purpose and calculation method.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface