ARXIV:2603.26486 · VISION-LANGUAGE MODELS · SUBMITTED 30 MAR · 21:52 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

Mriganka Nath · Anurag Das · Jiahao Xie · Bernt Schiele · arXiv

A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.

Evidence 72 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal. We show that such corruptions act as additional distribution shifts,…

METHOD

Full abstract

Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. Code availability is flagged in the production…

WHY NOW

Vision-Language Models moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.

Evidence72 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "f76614a3-1978-4559-a0a0-edd17a43a192", "arxiv_id": "2603.26486", "canonical_route": "/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "clipttt-clip-guided-test-time-training-helps-lvlms-see-better", "endpoints": { "paper_pack": "/api/v1/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better/paper-pack", "build_passport": "/api/v1/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better", "normalized_query": "2603.26486", "route": "/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better", "paper_ref": "clipttt-clip-guided-test-time-training-helps-lvlms-see-better", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better#webpage", "url": "https://sciencetostartup.com/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better", "name": "ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better", "description": "A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better#scholarlyArticle", "headline": "ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better", "description": "A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.", "url": "https://sciencetostartup.com/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better", "sameAs": "https://arxiv.org/abs/2603.26486", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26486" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T14:47:35.000Z", "author": [ { "@type": "Person", "name": "Mriganka Nath" }, { "@type": "Person", "name": "Anurag Das" }, { "@type": "Person", "name": "Jiahao Xie" }, { "@type": "Person", "name": "Bernt Schiele" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Models" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Bett", "item": "https://sciencetostartup.com/paper/clipttt-clip-guided-test-time-training-helps-lvlms-see-better" } ] } ] }

Competitive landscape

A method to adapt large vision-language models on the fly to reduce hallucinations caused by visual input corruption, using CLIP as a guidance signal.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline