ARXIV:2605.11803 · VIDEO LLMS · SUBMITTED 13 MAY · 20:36 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Minseok Kang · Minhyeok Lee · Jungho Lee · Minjung Kim · Donghyeong Kim · Dayeon Lee · +3 at arXiv

OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens. Training-free token compression has emerged as a practical solution to this bottleneck.

METHOD

Full abstract

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while…

WHY NOW

Video LLMs moved forward this cycle; last verified May 2026. Public score 8.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainOTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.

Segment

Video LLMs

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "286c9b5e-1a36-40fa-a12d-4cb4a359e897", "arxiv_id": "2605.11803", "canonical_route": "/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models/paper-pack", "build_passport": "/api/v1/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models", "normalized_query": "2605.11803", "route": "/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models", "paper_ref": "ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models", "name": "OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models", "description": "OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models#scholarlyArticle", "headline": "OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models", "description": "OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.", "url": "https://sciencetostartup.com/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models", "sameAs": "https://arxiv.org/abs/2605.11803", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.11803" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-12T08:58:49.000Z", "author": [ { "@type": "Person", "name": "Minseok Kang" }, { "@type": "Person", "name": "Minhyeok Lee" }, { "@type": "Person", "name": "Jungho Lee" }, { "@type": "Person", "name": "Minjung Kim" }, { "@type": "Person", "name": "Donghyeong Kim" }, { "@type": "Person", "name": "Dayeon Lee" }, { "@type": "Person", "name": "Heeseung Choi" }, { "@type": "Person", "name": "Ig-jae Kim" }, { "@type": "Person", "name": "Sangyoun Lee" } ], "codeRepository": "https://github.com/minseokii/OTT-Vid", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Video LLMs" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models#software", "name": "OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models - Source Code", "description": "OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.", "codeRepository": "https://github.com/minseokii/OTT-Vid", "url": "https://github.com/minseokii/OTT-Vid" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Video LLMs", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "OTT-Vid: Optimal Transport Temporal Token Compression for Vi", "item": "https://sciencetostartup.com/paper/ott-vid-optimal-transport-temporal-token-compression-for-video-large-language-models" } ] } ] }

Competitive landscape

OTT-Vid compresses video tokens for LLMs using optimal transport, preserving 95.8% of VQA performance with only 10% of tokens.

Segment

Video LLMs

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline