ARXIV:2603.28211 · INTERPRETABLE VISION-LANGUAGE MODELS · SUBMITTED 31 MAR · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Explaining CLIP Zero-shot Predictions Through Concepts

Onat Ozdemir · Anders Christensen · Stephan Alaniz · Zeynep Akata · Emre Akbas · arXiv

Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.

Evidence 49 refs | 4 sources | 83% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack…

METHOD

Full abstract

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing…

WHY NOW

Interpretable Vision-Language Models moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainExplain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.

Evidence49 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.

Segment

Interpretable Vision-Language Models

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "98fc2b42-e5f7-45ed-abbe-2017c2d819ac", "arxiv_id": "2603.28211", "canonical_route": "/paper/explaining-clip-zero-shot-predictions-through-concepts", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "explaining-clip-zero-shot-predictions-through-concepts", "endpoints": { "paper_pack": "/api/v1/paper/explaining-clip-zero-shot-predictions-through-concepts/paper-pack", "build_passport": "/api/v1/paper/explaining-clip-zero-shot-predictions-through-concepts/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Explaining CLIP Zero-shot Predictions Through Concepts", "normalized_query": "2603.28211", "route": "/paper/explaining-clip-zero-shot-predictions-through-concepts", "paper_ref": "explaining-clip-zero-shot-predictions-through-concepts", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/explaining-clip-zero-shot-predictions-through-concepts#webpage", "url": "https://sciencetostartup.com/paper/explaining-clip-zero-shot-predictions-through-concepts", "name": "Explaining CLIP Zero-shot Predictions Through Concepts", "description": "Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/explaining-clip-zero-shot-predictions-through-concepts#scholarlyArticle", "headline": "Explaining CLIP Zero-shot Predictions Through Concepts", "description": "Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.", "url": "https://sciencetostartup.com/paper/explaining-clip-zero-shot-predictions-through-concepts", "sameAs": "https://arxiv.org/abs/2603.28211", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.28211" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-30T09:31:33.000Z", "author": [ { "@type": "Person", "name": "Onat Ozdemir" }, { "@type": "Person", "name": "Anders Christensen" }, { "@type": "Person", "name": "Stephan Alaniz" }, { "@type": "Person", "name": "Zeynep Akata" }, { "@type": "Person", "name": "Emre Akbas" } ], "codeRepository": "https://github.com/oonat/ezpc", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Interpretable Vision-Language Models" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/explaining-clip-zero-shot-predictions-through-concepts#software", "name": "Explaining CLIP Zero-shot Predictions Through Concepts - Source Code", "description": "Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.", "codeRepository": "https://github.com/oonat/ezpc", "url": "https://github.com/oonat/ezpc" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Interpretable Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Explaining CLIP Zero-shot Predictions Through Concepts", "item": "https://sciencetostartup.com/paper/explaining-clip-zero-shot-predictions-through-concepts" } ] } ] }

Competitive landscape

Explain CLIP's zero-shot image recognition predictions using human-understandable concepts without additional supervision.

Segment

Interpretable Vision-Language Models

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Explaining CLIP Zero-shot Predictions Through Concepts

Explaining CLIP Zero-shot Predictions Through Concepts

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline