ARXIV:2603.12646 · LLM ROUTING · SUBMITTED 19 MAR · 18:48 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

arXiv

A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.

Blocked on Code›Score8.0Evidence unverified

Opportunity summary

Pain A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory…

METHOD

Full abstract

System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic. A public repository is linked, so build verification can…

WHY NOW

LLM Routing moved forward this cycle; last verified April 2026. Public score 8.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainA high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.

Evidence0 refs | 0 sources | 33% coverage

Blockermissing authors

Analysis summary

A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.

Segment

LLM Routing

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "01e7fcf0-0d50-4c9e-b240-e6b8152466b2", "arxiv_id": "2603.12646", "canonical_route": "/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s", "endpoints": { "paper_pack": "/api/v1/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s/paper-pack", "build_passport": "/api/v1/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "98$\\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router", "normalized_query": "2603.12646", "route": "/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s", "paper_ref": "98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s#webpage", "url": "https://sciencetostartup.com/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s", "name": "98$\\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router", "description": "A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s#scholarlyArticle", "headline": "98$\\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router", "description": "A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.", "url": "https://sciencetostartup.com/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s", "sameAs": "https://arxiv.org/abs/2603.12646", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.12646" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-13T04:33:53.000Z", "codeRepository": "https://github.com/vllm-project/semantic-router}", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Routing" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s#software", "name": "98$\\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router - Source Code", "description": "A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.", "codeRepository": "https://github.com/vllm-project/semantic-router}", "url": "https://github.com/vllm-project/semantic-router}" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Routing", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "98$\\times$ Faster LLM Routing Without a Dedicated GPU: Flash", "item": "https://sciencetostartup.com/paper/98-times-faster-llm-routing-without-a-dedicated-gpu-flash-attention-prompt-compression-and-near-streaming-for-the-vllm-s" } ] } ] }

Competitive landscape

A high-performance semantic router for LLMs that dramatically reduces latency and memory usage without needing a dedicated GPU.

Segment

LLM Routing

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline