CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-31
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 21
Source count: 4
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms | Route /signal-canvas/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms",
    "query_text": "Summarize CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs",
  "normalized_query": "2603.27958",
  "route": "/signal-canvas/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms",
  "paper_ref": "carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 8

References: 21

Proof: Verification pending

Freshness state: computing

Source paper: CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

PDF: https://arxiv.org/pdf/2603.27958v1

Source count: 4

Coverage: 50%

Last proof check: 2026-03-31T20:20:41.206Z

Signal Canvas receipt window

Watch and verify: CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

/buildability/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms

Watchwatch

Subject: CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark.
Implicationpartial
Explicitly stated in the abstract as a novel task and 'the first diagnostic benchmark'.
Verificationpartial
partial
Evidencepartial
Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%.
Implicationpartial
Directly stated in the abstract with a specific numeric result from Table 2.
Verificationpartial
partial
Evidencepartial
For closed-source models, they mostly fail at the decomposition stage, suggesting that while these models can percept the changes, they struggle to abstract the changes into symbolic rules.
Implicationpartial
Directly stated in the analysis section (Q1) with reference to Figure 3a.
Verificationpartial
partial
Evidencepartial
The performance drops as the number of atomic transformations (N) increases.
Implicationpartial
Strongly supported by the analysis in Figure 6 and its caption.
Verificationpartial
partial
Evidencepartial
Our results indicate MLLMs generally struggle with the generalization ability, and get worse when switch from Shared Source to Different Source setting.
Implicationpartial
Directly stated in the analysis (Q3) and supported by performance drops in Table 2.
Verificationpartial
partial
Evidencepartial
For open-source models, the challenge shifts into perception.
Implicationpartial
Directly stated in the analysis (Q1), contrasting with the finding for closed-source models.
Verificationpartial
partial
Evidencepartial
For most models, combinations among subject, number, and position contribute most to the failure.
Implicationpartial
Stated in the caption of Figure 4, though the evidence quote is from the surrounding text.
Verificationpartial
partial
Evidencepartial
Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence.
Implicationpartial
Explicitly stated in the abstract as the motivation for the work.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface