ARXIV:2603.12707 · MULTIMODAL LLM INFERENCE · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

arXiv

HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points…

METHOD

Full abstract

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points…

WHY NOW

Multimodal LLM Inference moved forward this cycle; last verified April 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainHeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.

Segment

Multimodal LLM Inference

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c6622d98-a9a1-447f-afbe-601001980c3d", "arxiv_id": "2603.12707", "canonical_route": "/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity", "endpoints": { "paper_pack": "/api/v1/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity/paper-pack", "build_passport": "/api/v1/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity", "normalized_query": "2603.12707", "route": "/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity", "paper_ref": "cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity#webpage", "url": "https://sciencetostartup.com/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity", "name": "Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity", "description": "HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity#scholarlyArticle", "headline": "Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity", "description": "HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.", "url": "https://sciencetostartup.com/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity", "sameAs": "https://arxiv.org/abs/2603.12707", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.12707" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-13T06:42:35.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal LLM Inference" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal LLM Inference", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU H", "item": "https://sciencetostartup.com/paper/cost-efficient-multimodal-llm-inference-via-cross-tier-gpu-heterogeneity" } ] } ] }

Competitive landscape

HeteroServe optimizes multimodal LLM inference through cost-efficient cross-tier GPU scheduling.

Segment

Multimodal LLM Inference

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline