Abstract
On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.
On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.
Community
Due to the company's open-source process, the code will be released within the next week~
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- On-Policy Distillation with Best-of-N Teacher Rollout Selection (2026)
- KL for a KL: On-Policy Distillation with Control Variate Baseline (2026)
- Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning (2026)
- Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe (2026)
- OISD: On-Policy Internal Self-Distillation of Language Models (2026)
- The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes (2026)
- Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.06021 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper