ARXIV:2605.30571 · LLM INFERENCE OPTIMIZATION · SUBMITTED 01 JUN · 20:26 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Josef Chen · arXiv

Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs. This workload is usually described as memory-bandwidth-bound.

METHOD

Full abstract

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. We show that this account is true but incomplete.

WHY NOW

LLM Inference Optimization moved forward this cycle; last verified June 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainOptimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.

Segment

LLM Inference Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "3a723445-fe3d-4c6a-a909-a2d520d76fc0", "arxiv_id": "2605.30571", "canonical_route": "/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode", "endpoints": { "paper_pack": "/api/v1/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode/paper-pack", "build_passport": "/api/v1/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode", "normalized_query": "2605.30571", "route": "/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode", "paper_ref": "memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode#webpage", "url": "https://sciencetostartup.com/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode", "name": "Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode", "description": "Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode#scholarlyArticle", "headline": "Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode", "description": "Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.", "url": "https://sciencetostartup.com/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode", "sameAs": "https://arxiv.org/abs/2605.30571", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.30571" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-28T21:03:14.000Z", "author": [ { "@type": "Person", "name": "Josef Chen" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Inference Optimization" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Inference Optimization", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Memory-Bound but Not Bandwidth-Limited: The Physical AI Infe", "item": "https://sciencetostartup.com/paper/memory-bound-but-not-bandwidth-limited-the-physical-ai-inference-gap-in-batch-1-llm-decode" } ] } ] }

Competitive landscape

Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.

Segment

LLM Inference Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline