ARXIV:2604.19145 · VISION-LANGUAGE MODELS · SUBMITTED 22 APR · 02:14 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Lin Sha · Haiyun Guo · Tao Wang · Cong Zhang · Min Huang · Jinqiao Wang · +1 at arXiv

A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view…

METHOD

Full abstract

Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing…

WHY NOW

Vision-Language Models moved forward this cycle; last verified April 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainA training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "a8072017-9ad9-4eb5-acc2-0317da6e2f98", "arxiv_id": "2604.19145", "canonical_route": "/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving", "endpoints": { "paper_pack": "/api/v1/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving/paper-pack", "build_passport": "/api/v1/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving", "normalized_query": "2604.19145", "route": "/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving", "paper_ref": "st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving#webpage", "url": "https://sciencetostartup.com/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving", "name": "ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving", "description": "A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving#scholarlyArticle", "headline": "ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving", "description": "A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.", "url": "https://sciencetostartup.com/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving", "sameAs": "https://arxiv.org/abs/2604.19145", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.19145" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-21T06:51:08.000Z", "author": [ { "@type": "Person", "name": "Lin Sha" }, { "@type": "Person", "name": "Haiyun Guo" }, { "@type": "Person", "name": "Tao Wang" }, { "@type": "Person", "name": "Cong Zhang" }, { "@type": "Person", "name": "Min Huang" }, { "@type": "Person", "name": "Jinqiao Wang" }, { "@type": "Person", "name": "Qinghai Miao" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Models" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vi", "item": "https://sciencetostartup.com/paper/st-prune-training-free-spatio-temporal-token-pruning-for-vision-language-models-in-autonomous-driving" } ] } ] }

Competitive landscape

A training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving, achieving near-lossless performance at 90% reduction.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline