ARXIV:2605.07355 · VIDEO-LANGUAGE MODELS · SUBMITTED 11 MAY · 20:36 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

TTF: Temporal Token Fusion for Efficient Video-Language Model

Simin Huo · Ning LI · arXiv

A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in…

METHOD

Full abstract

Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ${\approx}0.16$\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf} A public repository is linked, so build verification can inspect implementation evidence instead of treating the paper as…

WHY NOW

Video-Language Models moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.

Segment

Video-Language Models

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "d6d11688-5490-4d22-9226-07ca5a5cd28b", "arxiv_id": "2605.07355", "canonical_route": "/paper/ttf-temporal-token-fusion-for-efficient-video-language-model", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "ttf-temporal-token-fusion-for-efficient-video-language-model", "endpoints": { "paper_pack": "/api/v1/paper/ttf-temporal-token-fusion-for-efficient-video-language-model/paper-pack", "build_passport": "/api/v1/paper/ttf-temporal-token-fusion-for-efficient-video-language-model/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "TTF: Temporal Token Fusion for Efficient Video-Language Model", "normalized_query": "2605.07355", "route": "/paper/ttf-temporal-token-fusion-for-efficient-video-language-model", "paper_ref": "ttf-temporal-token-fusion-for-efficient-video-language-model", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/ttf-temporal-token-fusion-for-efficient-video-language-model#webpage", "url": "https://sciencetostartup.com/paper/ttf-temporal-token-fusion-for-efficient-video-language-model", "name": "TTF: Temporal Token Fusion for Efficient Video-Language Model", "description": "A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/ttf-temporal-token-fusion-for-efficient-video-language-model#scholarlyArticle", "headline": "TTF: Temporal Token Fusion for Efficient Video-Language Model", "description": "A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.", "url": "https://sciencetostartup.com/paper/ttf-temporal-token-fusion-for-efficient-video-language-model", "sameAs": "https://arxiv.org/abs/2605.07355", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.07355" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-08T07:08:54.000Z", "author": [ { "@type": "Person", "name": "Simin Huo" }, { "@type": "Person", "name": "Ning LI" } ], "codeRepository": "https://github.com/Cominder/ttf", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Video-Language Models" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/ttf-temporal-token-fusion-for-efficient-video-language-model#software", "name": "TTF: Temporal Token Fusion for Efficient Video-Language Model - Source Code", "description": "A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.", "codeRepository": "https://github.com/Cominder/ttf", "url": "https://github.com/Cominder/ttf" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Video-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "TTF: Temporal Token Fusion for Efficient Video-Language Mode", "item": "https://sciencetostartup.com/paper/ttf-temporal-token-fusion-for-efficient-video-language-model" } ] } ] }

Competitive landscape

A training-free, plug-and-play framework that fuses temporal tokens in videos to significantly reduce inference costs for video-language models without sacrificing accuracy.

Segment

Video-Language Models

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

TTF: Temporal Token Fusion for Efficient Video-Language Model

TTF: Temporal Token Fusion for Efficient Video-Language Model

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline