ARXIV:2604.00626 · LLM TRAINING · SUBMITTED 02 APR · 21:07 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song · Mao Zheng · arXiv

A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.

Blocked on Code›Score2.0Evidence unverified

Opportunity summary

Pain A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.

Evidence 21 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never…

METHOD

Full abstract

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

RESULT

ScienceToStartup currently rates this 2.0/10 on the public viability pass. We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 2.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score2.0

PainA survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.

Evidence21 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

2.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "13ee862c-7707-4d4a-aca0-2f64ee0d117c", "arxiv_id": "2604.00626", "canonical_route": "/paper/a-survey-of-on-policy-distillation-for-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "a-survey-of-on-policy-distillation-for-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/a-survey-of-on-policy-distillation-for-large-language-models/paper-pack", "build_passport": "/api/v1/paper/a-survey-of-on-policy-distillation-for-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "A Survey of On-Policy Distillation for Large Language Models", "normalized_query": "2604.00626", "route": "/paper/a-survey-of-on-policy-distillation-for-large-language-models", "paper_ref": "a-survey-of-on-policy-distillation-for-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/a-survey-of-on-policy-distillation-for-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/a-survey-of-on-policy-distillation-for-large-language-models", "name": "A Survey of On-Policy Distillation for Large Language Models", "description": "A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/a-survey-of-on-policy-distillation-for-large-language-models#scholarlyArticle", "headline": "A Survey of On-Policy Distillation for Large Language Models", "description": "A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.", "url": "https://sciencetostartup.com/paper/a-survey-of-on-policy-distillation-for-large-language-models", "sameAs": "https://arxiv.org/abs/2604.00626", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.00626" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-01T08:32:34.000Z", "author": [ { "@type": "Person", "name": "Mingyang Song" }, { "@type": "Person", "name": "Mao Zheng" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 2 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "A Survey of On-Policy Distillation for Large Language Models", "item": "https://sciencetostartup.com/paper/a-survey-of-on-policy-distillation-for-large-language-models" } ] } ] }

Competitive landscape

A survey providing a unified framework and comprehensive overview of on-policy distillation techniques for large language models, addressing exposure bias.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

2.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

A Survey of On-Policy Distillation for Large Language Models

A Survey of On-Policy Distillation for Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline