ARXIV:2605.23482 · VISION-LANGUAGE DATASET DISTILLATION · SUBMITTED 25 MAY · 20:33 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Jongoh Jeong · Hoyong Kwon · Minseok Kim · Kuk-Jin Yoon · arXiv

A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.

Evidence 0 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment…

METHOD

Full abstract

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across…

WHY NOW

Vision-Language Dataset Distillation moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.

Evidence0 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.

Segment

Vision-Language Dataset Distillation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "980cedd3-58f5-4d2f-a01d-5f03eb6b4a8a", "arxiv_id": "2605.23482", "canonical_route": "/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "multimodal-distribution-matching-for-vision-language-dataset-distillation", "endpoints": { "paper_pack": "/api/v1/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation/paper-pack", "build_passport": "/api/v1/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Multimodal Distribution Matching for Vision-Language Dataset Distillation", "normalized_query": "2605.23482", "route": "/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation", "paper_ref": "multimodal-distribution-matching-for-vision-language-dataset-distillation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation#webpage", "url": "https://sciencetostartup.com/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation", "name": "Multimodal Distribution Matching for Vision-Language Dataset Distillation", "description": "A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation#scholarlyArticle", "headline": "Multimodal Distribution Matching for Vision-Language Dataset Distillation", "description": "A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.", "url": "https://sciencetostartup.com/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation", "sameAs": "https://arxiv.org/abs/2605.23482", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.23482" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-22T10:41:58.000Z", "author": [ { "@type": "Person", "name": "Jongoh Jeong" }, { "@type": "Person", "name": "Hoyong Kwon" }, { "@type": "Person", "name": "Minseok Kim" }, { "@type": "Person", "name": "Kuk-Jin Yoon" } ], "codeRepository": "https://github.com/cvpr-org/author-kit", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Dataset Distillation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation#software", "name": "Multimodal Distribution Matching for Vision-Language Dataset Distillation - Source Code", "description": "A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.", "codeRepository": "https://github.com/cvpr-org/author-kit", "url": "https://github.com/cvpr-org/author-kit" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Dataset Distillation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Multimodal Distribution Matching for Vision-Language Dataset", "item": "https://sciencetostartup.com/paper/multimodal-distribution-matching-for-vision-language-dataset-distillation" } ] } ] }

Competitive landscape

A geometry-aware framework for efficient multimodal dataset distillation that preserves cross-modal alignment and reduces computational cost.

Segment

Vision-Language Dataset Distillation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline