ARXIV:2603.02731 · MOE MODEL TRAINING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

arXiv

Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native…

METHOD

Full abstract

Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 $\leftrightarrow$ BF16 $\leftrightarrow$ FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel communication are compressed using MXFP4, achieving substantial memory and bandwidth savings without degrading convergence. At the 671B parameter scale, our method achieves end-to-end training performance comparable to strong FP8 baselines, while reducing peak activation memory by 14.8\% (11.8 GB) and improving training throughput by 12.5\%, from 1157 to 1302 tokens per GPU per second. These results show that FP4 efficiency can be practically realized for large-scale MoE training through careful software-hardware co-design, even without native FP4 Tensor Core support.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4…

WHY NOW

MoE Model Training moved forward this cycle; last verified April 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainOptimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.

Segment

MoE Model Training

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "e4dac3cd-a258-4901-bbfc-bc4972818cf4", "arxiv_id": "2603.02731", "canonical_route": "/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus", "endpoints": { "paper_pack": "/api/v1/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus/paper-pack", "build_passport": "/api/v1/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs", "normalized_query": "2603.02731", "route": "/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus", "paper_ref": "practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus#webpage", "url": "https://sciencetostartup.com/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus", "name": "Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs", "description": "Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus#scholarlyArticle", "headline": "Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs", "description": "Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.", "url": "https://sciencetostartup.com/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus", "sameAs": "https://arxiv.org/abs/2603.02731", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.02731" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-03T08:29:19.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "MoE Model Training" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "MoE Model Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Practical FP4 Training for Large-Scale MoE Models on Hopper ", "item": "https://sciencetostartup.com/paper/practical-fp4-training-for-large-scale-moe-models-on-hopper-gpus" } ] } ] }

Competitive landscape

Optimize large-scale MoE model training on Hopper GPUs by implementing FP4 without native support.

Segment

MoE Model Training

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline