ARXIV:2604.01621 · LLM INFERENCE OPTIMIZATION · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Wanqian Li · Jintao Peng · Zongfei Jing · Tianyu Zhang · Ze Long · Xianjie Qiao · +4 at arXiv

A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.

Blocked on Code›Score4.0Evidence unverified

Opportunity summary

Pain A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution…

METHOD

Full abstract

Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user…

WHY NOW

LLM Inference Optimization moved forward this cycle; last verified April 2026. Public score 4.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.

Segment

LLM Inference Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "1fd47146-dc1c-4ad3-a0f0-cf0283f71f7e", "arxiv_id": "2604.01621", "canonical_route": "/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72", "endpoints": { "paper_pack": "/api/v1/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72/paper-pack", "build_passport": "/api/v1/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72", "normalized_query": "2604.01621", "route": "/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72", "paper_ref": "dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72#webpage", "url": "https://sciencetostartup.com/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72", "name": "DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72", "description": "A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72#scholarlyArticle", "headline": "DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72", "description": "A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.", "url": "https://sciencetostartup.com/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72", "sameAs": "https://arxiv.org/abs/2604.01621", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.01621" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T05:00:08.000Z", "author": [ { "@type": "Person", "name": "Wanqian Li" }, { "@type": "Person", "name": "Jintao Peng" }, { "@type": "Person", "name": "Zongfei Jing" }, { "@type": "Person", "name": "Tianyu Zhang" }, { "@type": "Person", "name": "Ze Long" }, { "@type": "Person", "name": "Xianjie Qiao" }, { "@type": "Person", "name": "Xiaoming Chen" }, { "@type": "Person", "name": "Dongxu Yang" }, { "@type": "Person", "name": "Kefeng Duan" }, { "@type": "Person", "name": "June Yang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Inference Optimization" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Inference Optimization", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "DWDP: Distributed Weight Data Parallelism for High-Performan", "item": "https://sciencetostartup.com/paper/dwdp-distributed-weight-data-parallelism-for-high-performance-llm-inference-on-nvl72" } ] } ] }

Competitive landscape

A novel inference parallelization strategy for LLMs that improves performance by offloading MoE weights and enabling independent GPU execution.

Segment

LLM Inference Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline