VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-04-03
Score updated: 2026-04-03
Score fresh until: 2026-05-03
References: 0
Source count: 0
Coverage: 33%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification | Route /signal-canvas/videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification",
    "query_text": "Summarize VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification",
  "normalized_query": "2604.01569",
  "route": "/signal-canvas/videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification",
  "paper_ref": "videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 8

References: Pending verification

Proof: Verification pending

Freshness state: computing

Source paper: VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

PDF: https://arxiv.org/pdf/2604.01569v1

Source count: Pending verification

Coverage: 33%

Last proof check: 2026-04-03T20:50:41.059Z

Signal Canvas receipt window

Watch and verify: VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

/buildability/videozerobench-probing-the-limits-of-video-mllms-with-spatio-temporal-evidence-verification

Watchwatch

Subject: VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning
Implicationpartial
Directly stated in the abstract as a critical limitation of current evaluations.
Verificationpartial
partial
Evidencepartial
(2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions
Implicationpartial
Directly stated in the abstract as the second critical limitation of current evaluations.
Verificationpartial
partial
Evidencepartial
we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains
Implicationpartial
Directly stated in the abstract with specific numbers and scope.
Verificationpartial
partial
Evidencepartial
Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3)
Implicationpartial
Directly stated in the abstract with specific numeric result and model name.
Verificationpartial
partial
Evidencepartial
No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5)
Implicationpartial
Directly stated in the abstract with specific numeric result and clear performance threshold.
Verificationpartial
partial
Evidencepartial
with most failing to achieve any correct grounded predictions
Implicationpartial
Directly stated in the abstract as a consequence of the Level-5 evaluation.
Verificationpartial
partial
Evidencepartial
These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning
Implicationpartial
Directly stated in the abstract as the main conclusion from the experimental results.
Verificationpartial
partial
Evidencepartial
revealing that grounded video understanding remains a bottleneck for long-video QA
Implicationpartial
Directly stated in the abstract as the key finding from the benchmark results.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface