ARXIV:2602.03309 · LLM TRAINING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

arXiv

Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with…

METHOD

Full abstract

Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi…

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainDevelop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(14)

Reference metadata pending (804039e844664365265dca969ae612d32a992503)

Reference metadata pending (35b142ea69598e6241f0011312128031df55895c)

Reference metadata pending (91206346edbe28abb606d7b3425cd455d4019d4f)

Reference metadata pending (0d1c76d45afa012ded7ab741194baf142117c495)

Reference metadata pending (3ab661db57d924f4ff1706e05ac807873ca00e0a)

Reference metadata pending (e7ad08848d5d7c5c47673ffe0da06af443643bda)

Reference metadata pending (23dd78e424d32f6a48660dcd67ce994b8a7db8be)

Reference metadata pending (d766bffc357127e0dc86dd69561d5aeb520d6f4c)

Reference metadata pending (1b6e810ce0afd0dd093f789d2b2742d047e316d5)

Reference metadata pending (57d1e7ac339e783898f2c3b1af55737cbeee9fc5)

Reference metadata pending (82b4b03a4659d6e04bd7cbf51d6e08fde1348dbd)

Reference metadata pending (dce6f9d4017b1785979e7520fd0834ef8cf02f4b)

Reference metadata pending (5bbb6f9a8204eb13070b6f033e61c84ef8ee68dd)

Reference metadata pending (745c614dbd23bc1e3def79f600680b88cee28700)

{ "contract_version": "paper-r2", "paper_id": "64244968-4414-42e9-9757-7f7e888b3436", "arxiv_id": "2602.03309", "canonical_route": "/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models/paper-pack", "build_passport": "/api/v1/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models", "normalized_query": "2602.03309", "route": "/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models", "paper_ref": "entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models", "name": "Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models", "description": "Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models#scholarlyArticle", "headline": "Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models", "description": "Develop an entropy-gated framework for optimizing hybrid training in language models to improve mathematical reasoning performance.", "url": "https://sciencetostartup.com/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models", "sameAs": "https://arxiv.org/abs/2602.03309", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2602.03309" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-02-03T09:38:21.000Z", "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "804039e844664365265dca969ae612d32a992503" }, "url": "https://www.semanticscholar.org/paper/804039e844664365265dca969ae612d32a992503" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "35b142ea69598e6241f0011312128031df55895c" }, "url": "https://www.semanticscholar.org/paper/35b142ea69598e6241f0011312128031df55895c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "91206346edbe28abb606d7b3425cd455d4019d4f" }, "url": "https://www.semanticscholar.org/paper/91206346edbe28abb606d7b3425cd455d4019d4f" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "0d1c76d45afa012ded7ab741194baf142117c495" }, "url": "https://www.semanticscholar.org/paper/0d1c76d45afa012ded7ab741194baf142117c495" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "3ab661db57d924f4ff1706e05ac807873ca00e0a" }, "url": "https://www.semanticscholar.org/paper/3ab661db57d924f4ff1706e05ac807873ca00e0a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "e7ad08848d5d7c5c47673ffe0da06af443643bda" }, "url": "https://www.semanticscholar.org/paper/e7ad08848d5d7c5c47673ffe0da06af443643bda" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "23dd78e424d32f6a48660dcd67ce994b8a7db8be" }, "url": "https://www.semanticscholar.org/paper/23dd78e424d32f6a48660dcd67ce994b8a7db8be" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d766bffc357127e0dc86dd69561d5aeb520d6f4c" }, "url": "https://www.semanticscholar.org/paper/d766bffc357127e0dc86dd69561d5aeb520d6f4c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "1b6e810ce0afd0dd093f789d2b2742d047e316d5" }, "url": "https://www.semanticscholar.org/paper/1b6e810ce0afd0dd093f789d2b2742d047e316d5" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "57d1e7ac339e783898f2c3b1af55737cbeee9fc5" }, "url": "https://www.semanticscholar.org/paper/57d1e7ac339e783898f2c3b1af55737cbeee9fc5" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "82b4b03a4659d6e04bd7cbf51d6e08fde1348dbd" }, "url": "https://www.semanticscholar.org/paper/82b4b03a4659d6e04bd7cbf51d6e08fde1348dbd" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "dce6f9d4017b1785979e7520fd0834ef8cf02f4b" }, "url": "https://www.semanticscholar.org/paper/dce6f9d4017b1785979e7520fd0834ef8cf02f4b" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "5bbb6f9a8204eb13070b6f033e61c84ef8ee68dd" }, "url": "https://www.semanticscholar.org/paper/5bbb6f9a8204eb13070b6f033e61c84ef8ee68dd" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "745c614dbd23bc1e3def79f600680b88cee28700" }, "url": "https://www.semanticscholar.org/paper/745c614dbd23bc1e3def79f600680b88cee28700" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Entropy-Gated Selective Policy Optimization:Token-Level Grad", "item": "https://sciencetostartup.com/paper/entropy-gated-selective-policy-optimization-token-level-gradient-allocation-for-hybrid-training-of-large-language-models" } ] } ] }