ARXIV:2605.21427 · LLM SERVING · SUBMITTED 21 MAY · 20:28 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Can Hankendi · Rana Shahout · Minlan Yu · Ayse K. Coskun · arXiv

PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS. While prior systems optimize throughput and latency by batching, scheduling, and parallelism,…

METHOD

Full abstract

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to…

WHY NOW

LLM Serving moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainPALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.

Segment

LLM Serving

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "f64adcb0-165b-432c-814a-dfae39d57b00", "arxiv_id": "2605.21427", "canonical_route": "/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "pals-power-aware-llm-serving-for-mixture-of-experts-models", "endpoints": { "paper_pack": "/api/v1/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models/paper-pack", "build_passport": "/api/v1/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "PALS: Power-Aware LLM Serving for Mixture-of-Experts Models", "normalized_query": "2605.21427", "route": "/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models", "paper_ref": "pals-power-aware-llm-serving-for-mixture-of-experts-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models#webpage", "url": "https://sciencetostartup.com/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models", "name": "PALS: Power-Aware LLM Serving for Mixture-of-Experts Models", "description": "PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models#scholarlyArticle", "headline": "PALS: Power-Aware LLM Serving for Mixture-of-Experts Models", "description": "PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.", "url": "https://sciencetostartup.com/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models", "sameAs": "https://arxiv.org/abs/2605.21427", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.21427" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-20T17:19:20.000Z", "author": [ { "@type": "Person", "name": "Can Hankendi" }, { "@type": "Person", "name": "Rana Shahout" }, { "@type": "Person", "name": "Minlan Yu" }, { "@type": "Person", "name": "Ayse K. Coskun" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Serving" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Serving", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "PALS: Power-Aware LLM Serving for Mixture-of-Experts Models", "item": "https://sciencetostartup.com/paper/pals-power-aware-llm-serving-for-mixture-of-experts-models" } ] } ] }

Competitive landscape

PALS is a power-aware runtime for LLM serving that optimizes GPU power caps with software parameters to improve energy efficiency and QoS.

Segment

LLM Serving

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline