ARXIV:2603.27958 · MULTIMODAL LLM EVALUATION · SUBMITTED 31 MAR · 20:20 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Yongkang Du · Xiaohan Zou · Minhao Cheng · Lu Lin · arXiv

A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.

Evidence 21 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities. Existing evaluations of this ability in multimodal large language models (MLLMs)…

METHOD

Full abstract

Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting…

WHY NOW

Multimodal LLM Evaluation moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.

Evidence21 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.

Segment

Multimodal LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "e252ec66-4ad0-49b8-979b-ae7f25c47a51", "arxiv_id": "2603.27958", "canonical_route": "/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms", "endpoints": { "paper_pack": "/api/v1/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms/paper-pack", "build_passport": "/api/v1/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs", "normalized_query": "2603.27958", "route": "/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms", "paper_ref": "carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms#webpage", "url": "https://sciencetostartup.com/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms", "name": "CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs", "description": "A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms#scholarlyArticle", "headline": "CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs", "description": "A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.", "url": "https://sciencetostartup.com/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms", "sameAs": "https://arxiv.org/abs/2603.27958", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.27958" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-30T02:22:30.000Z", "author": [ { "@type": "Person", "name": "Yongkang Du" }, { "@type": "Person", "name": "Xiaohan Zou" }, { "@type": "Person", "name": "Minhao Cheng" }, { "@type": "Person", "name": "Lu Lin" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "CARV: A Diagnostic Benchmark for Compositional Analogical Re", "item": "https://sciencetostartup.com/paper/carv-a-diagnostic-benchmark-for-compositional-analogical-reasoning-in-multimodal-llms" } ] } ] }

Competitive landscape

A new benchmark and dataset to diagnose and improve compositional analogical reasoning in multimodal LLMs, revealing significant performance gaps compared to human capabilities.

Segment

Multimodal LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline