ARXIV:2604.02816 · LLM COMPRESSION · SUBMITTED 06 APR · 20:16 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Xinhao Wang · Zhonyu Xia · Zhiwei Lin · Zhe Li · Yongtao Wang · arXiv

A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent…

METHOD

Full abstract

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers…

WHY NOW

LLM Compression moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.

Segment

LLM Compression

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "226e8633-1152-4597-b63b-58ec2475e3a0", "arxiv_id": "2604.02816", "canonical_route": "/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models/paper-pack", "build_passport": "/api/v1/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models", "normalized_query": "2604.02816", "route": "/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models", "paper_ref": "qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models", "name": "QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models", "description": "A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models#scholarlyArticle", "headline": "QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models", "description": "A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.", "url": "https://sciencetostartup.com/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models", "sameAs": "https://arxiv.org/abs/2604.02816", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02816" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-03T07:32:07.000Z", "author": [ { "@type": "Person", "name": "Xinhao Wang" }, { "@type": "Person", "name": "Zhonyu Xia" }, { "@type": "Person", "name": "Zhiwei Lin" }, { "@type": "Person", "name": "Zhe Li" }, { "@type": "Person", "name": "Yongtao Wang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Compression" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Compression", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "QAPruner: Quantization-Aware Vision Token Pruning for Multim", "item": "https://sciencetostartup.com/paper/qapruner-quantization-aware-vision-token-pruning-for-multimodal-large-language-models" } ] } ] }

Competitive landscape

A framework for co-optimizing vision token pruning and quantization to enable efficient deployment of multimodal LLMs.

Segment

LLM Compression

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline